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Message from the Program Co-Chairs 


Welcome to the 7th USENIX Conference on File and Storage Technologies! It has been our honor and pleasure to 
work with an outstanding program committee and the exceptional USENIX staff to bring this program to you. 


File and storage technologies continue to be a critical component in our computing landscape—the 102 submis- 
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simple “thank you” from the program co-chairs seems hardly enough—if you see these people at the conference or 
encounter them elsewhere, please acknowledge the contribution they made to producing this conference! 


We also owe a debt of gratitude to the authors of all the papers submitted. Each paper represented a significant 
piece of work, and we could have produced multiple, interesting conferences had that been our charter. 


This conference could not happen without the outstanding USENIX staff. They are always there to do what needs 
to get done, nag us into doing what we need to do, solve problems, answer questions, etc. And they always do it 
with good will and cheer. 


And finally, after running conferences for over fifteen years, we were thrilled to have a conference management 
system that really did what it needed to do. We give a big thank you to Eddie Kohler for his HotCRP system—it’s 
easy to use and has every feature we needed, and Eddie was more responsive than any commercial customer ser- 
vice organization we’d ever encountered. Many, many thanks! 


We hope you enjoy the conference. 


Ric Wheeler, Red Hat 
Margo Seltzer, Harvard University 
FAST ’09 Program Co-Chairs 
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The Case of the Fake Picasso: Preventing History Forgery 
with Secure Provenance 


Ragib Hasan 
rhasan@ illinois.edu 
University of Illinois 
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Abstract 


As increasing amounts of valuable information are pro- 
duced and persist digitally, the ability to determine the 
origin of data becomes important. In science, medicine, 
commerce, and government, data provenance tracking 
is essential for rights protection, regulatory compliance, 
management of intelligence and medical data, and au- 
thentication of information as it flows through workplace 
tasks. In this paper, we show how to provide strong 
integrity and confidentiality assurances for data prove- 
nance information. We describe our provenance-aware 
system prototype that implements provenance tracking 
of data writes at the application layer, which makes it 
extremely easy to deploy. We present empirical results 
that show that, for typical real-life workloads, the run- 
time overhead of our approach to recording provenance 
with confidentiality and integrity guarantees ranges from 
1% — 13%. 


1 Introduction 


Provenance information summarizes the history of the 
ownership of items and the actions performed on them. 
For example, scientists need to keep track of data cre- 
ation, ownership, and processing workflow to ensure a 
certain level of trust in their experimental results. The 
National Geographic Society’s Genographic Project and 
the DNA Shoah project (for Holocaust survivors search- 
ing for remains of their dead relatives) both track the pro- 
cessing of DNA samples. Individuals who submit DNA 
samples for testing through these programs want strong 
assurances that no unauthorized parties will be able to 
see the provenance of the samples (e.g., provide it to in- 
surance companies or anti-Semitic organizations). 
Regulatory and legal considerations mandate other 
provenance assurances. The US Sarbanes-Oxley Act 
[56] sets prison terms for officers of companies that is- 
sue incorrect financial statements. As a result, offi- 
cers have become very interested in tracking the path 
that a financial report took during its development, in- 
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cluding both input data origins and authors. The US 
Gramm-Leach-Bliley Act [40] and Securities and Ex- 
change Commission rule 17a [55] also require docu- 
mentation and audit trails for financial records, as do 
many non-financial compliance regulations. For exam- 
ple, the US Health Insurance Portability and Account- 
ability Act mandates logging of access and change histo- 
ries for medical records [13]. 

Provenance tracking of physical artifacts is relying in- 
creasingly on digital shipping, manufacturing, and lab- 
oratory records, often with high-stakes financial incen- 
tives to omit or alter entries. For example, pharmaceuti- 
cals’ provenance is carefully tracked as they move from 
the manufacturing laboratory through a long succession 
of middlemen to the consumer. Clinical trials of new 
medical devices and treatments involve detailed record- 
keeping, as does US FDA testing of proposed new food 
additives. 

To help manage the above processes, digital prove- 
nance mechanisms support the collection and persistence 
of information about the creation, access, and transfer 
of data. While significant research has been conducted 
on how to collect, store, and query provenance infor- 
mation, the associated integrity and privacy issues have 
not been explored. But without appropriate guarantees, 
as data crosses application and organization boundaries 
and passes through untrusted environments, its associ- 
ated provenance information becomes vulnerable to il- 
licit alteration and should not be trusted. 

For example, consider the repudiation incentives in the 
following real-life anonymized medical litigation sce- 
nario. Alice visited Dr. B for consultation. B referred her 
to Dr. Mallory for tests, and sent Alice’s medical records 
to Mallory, who failed to analyze the test results prop- 
erly, and provided incorrect information to B. B provided 
these reports along with other information to Dr. C, who 
treated Alice accordingly. When Alice subsequently suf- 
fered from health problems related to the incorrect diag- 
nosis, she sued B and C for malpractice. To establish 
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Mallory’s liability for the misdiagnosis, B and C hired 
Audrey as an expert witness. Audrey used the prove- 
nance information in Alice’s medical records to establish 
the exact sequence of events, which in this case impli- 
cated Mallory. If Mallory had been innocent, B and C 
should not be able to collude and falsely implicate him. 
Similarly, if Mallory altered his faulty diagnosis in Al- 
ice’s medical records after the fact, Audrey should be 
able to detect that. 

Making provenance records trustworthy is challeng- 
ing. Ideally, we need to guarantee completeness — all 
relevant actions pertaining to a document are captured; 
integrity — adversaries cannot forge or alter provenance 
entries; availability — auditors can verify the integrity 
of provenance information; confidentiality — only autho- 
rized parties should read provenance records; and effi- 
ciency — provenance mechanisms should have low over- 
heads. 

In this paper, we propose and evaluate mechanisms 
for secure document provenance that address these prop- 
erties. In particular, our first contribution is a cross- 
platform, low-overhead architecture for capturing prove- 
nance information at the application layer. This archi- 
tecture captures the I/O write requests of all applications 
that are linked with the library, extracts the new data be- 
ing written and the identity of the application writing it, 
and appends that information to the provenance chain for 
the document being written. Further, the resulting prove- 
nance chain is secure in the sense that a particular entry in 
the chain can only be read by the auditors specifically au- 
thorized to read it, and no one can add or remove entries 
from the middle of the chain without detection. Our sec- 
ond contribution is an implementation of our approach 
for file systems, along with an experimental evaluation 
that shows that our approach introduces little overhead at 
run time, only 1%-13% for typical real-life workloads. 


2 Provenance Model 


In this section, we define basic provenance-related con- 
cepts and discuss deployment and threat models. 


2.1 Definitions and Usage Model 


A document is an abstraction for a data item for which 
provenance information is collected, such as a file, 
database tuple, or network packet. 

We define provenance of a document to be the record 
of actions taken on that particular document over its life- 
time. Note that this definition differs from the informa- 
tion flow provenance used in PASS [38] and some other 
systems. Each access to a document D may generate a 
provenance record P. The types of access that should 
generate a provenance record and the exact contents of 
the record are domain-specific, but in general P may 
include the identity of the accessing principal; a log of 
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the access actions (e.g., read, write) and their associ- 
ated data (e.g., bytes of D or its metadata read/written); 
a description of the environment when the action was 
made, such as the time of day and the software envi- 
ronment; and confidentiality- and integrity-related com- 
ponents, such as cryptographic signatures, checksums, 
and keying material. A provenance chain for document 
D is a non-empty time-ordered sequence of provenance 
records P;|--- |P,. In real deployments, the chain is 
associated and transported together with a document D. 

In a given security domain, users are principals who 
read and write to the documents, and/or make changes 
to document metadata. In a given organization, there are 
one or more auditors, who are principals authorized to 
access and verify the integrity of provenance chains as- 
sociated with documents. Every user trusts a subset of 
the auditors. There can be an auditor who is trusted by 
everyone, and referred to as the superauditor. 

Documents, and associated provenance chains are 
stored locally in the current user’s machine. The local 
machines of the users are not trusted. Each user has com- 
plete control over the software and hardware of her local 
machine and storage. Documents can be transferred from 
one machine to another. A transfer of a document from 
one machine to another also causes the provenance chain 
to be transferred to the recipient. 

Adversaries are inside or outside principals with ac- 
cess to the chains, who want to make undetected changes 
to a provenance chain for personal benefit. We do not 
consider denial of service attacks such as the total re- 
moval of a provenance chain. 

We assume readers are familiar with semantically se- 
cure (IND-CPA) encryption and signature mechanisms 
[22] and cryptographic hashes [34]. We use ideal, 
collision-free hashes and strongly unforgeable signa- 
tures. We denote by S;,(a) a public key signature with 
key & on item x. a|b and a, b denote concatenating b af- 
ter a. 


2.2 Threat Model 


In this paper, we focus on tracking document writes 
and securing the provenance information associated with 
them. We leave as future work the questions of how to 
ensure that provenance information is always collected, 
how to track document reads efficiently, and certain other 
technical issues discussed below. We now discuss the 
reasons for choosing this focus. 

A provenance tracking system implemented at a par- 
ticular level is oblivious to attacks that take place outside 
the view of that level. For example, suppose that we im- 
plement provenance tracking in the OS kernel. If the ker- 
nel is not running on hardware that offers special security 
guarantees, an intruder can take over the machine, sub- 
vert the kernel, and circumvent the provenance system. 
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Thus, without a trusted pervasive hardware infras- 
tructure and an implementation of provenance tracking 
at that level, we cannot prevent all potential attacks on 
provenance chains. Even in such an environment, a ma- 
licious user who can read a document can always memo- 
rize and replicate portions thereof later, minus the appro- 
priate provenance information. For example, an indus- 
trial spy can memorize technical material and reproduce 
it later on in verbatim or edited form. Overall, it is im- 
practical to assume that we have the ability to fully mon- 
itor the information flow channels available to attackers. 
Thus, our power to track the origin of data is limited. 


Fortunately, in many applications of provenance- 
aware systems, illicit document copying and/or complete 
removal of provenance chains are not significant threats. 
For example, in cattle tracking, we are not worried that 
someone will try to steal a (digital record of a) cow and 
try to pass it off as their own. Similarly, a cow with no 
provenance record at all is highly suspect, and many dif- 
ferent parties would have to collude to fabricate a con- 
vincing provenance history for that cow. Instead, the pri- 
mary concern is that a farmer might want to rewrite his- 
tory by omitting a record showing that a particular sick 
cow previously lived at his feed lot. As another appli- 
cation, a retail pharmacy will not accept a shipment of 
drugs unless it can be shown that the drugs have passed 
through the hands of certain middlemen. Thus, if an 
enterprising crook wants to sell drugs manufactured by 
an unlicensed company, he might want to forge a prove- 
nance chain that gives the drugs a more respectable his- 
tory, in order to move them into the supply chain. Simi- 
larly, there is little danger that someone will remove the 
provenance chain associated with a box of Prada acces- 
sories, and try to pass them off as another brand. In- 
stead, the incentive is to pass off non-Prada accessories 
as Prada. This is very hard to do, as a colluder with 
the ability to put the Prada signature on the accessories’ 
provenance chain needs to be found. Anyone who can do 
that signature is a Prada insider, and Prada insiders have 
little incentive to endorse fake merchandise. 


Of course, there are applications where there are sig- 
nificant incentives to track data reads and for malicious 
parties to sever documents from their original prove- 
nance chains. For example, one may track the flow of 
intelligence information so that it can be traced back to 
its original source. It is possible to track all reads and 
then use that to track information flows. However, do- 
ing that is expensive, especially in the long run as prove- 
nance chains get longer and longer. It is impossible to 
offer complete provenance assurances in such situations, 
but certain measures can be taken, such as digital water- 
marking and kernel-level tracking of all data read by a 
writing process [38] (which is expensive due to the need 
to log all data read). Overall, however, we believe that 


in the presence of any non-trivial threat model, tracking 
read operations for the purpose of provenance collection 
is impractical and necessarily insecure, because the ad- 
versary can always read data through unmonitored chan- 
nels and produce verbatim or edited copies. Thus, we 
will not consider the tracking of read operations further 
in this paper. 

The primary threat we guard against in this paper is 
undetected rewrites of history, which occur when ma- 
licious entities forge provenance chains to match illicit 
document writes and metadata modifications. Specifi- 
cally, suppose that we have a provenance chain ([A], [B], 
[C], [D], [E], [F]), in which, for simplicity, each en- 
try is denoted by the identity of its corresponding prin- 
cipal A, B,C,.... Then, we will provide the following 
integrity and confidentiality assurances: 


e Ii: An adversary acting alone cannot selectively re- 
move other principals’ entries from the start or the 
middle of the chain without being detected by the 
next audit. 

e 12: An adversary acting alone cannot add entries 
in the beginning or the middle of the chain without 
being detected by the next audit. 

e I3: Two colluding adversaries cannot add entries 
of other non-colluding users “between” them in the 
chain, without being detected by the next audit. 


For example, colluding users B and D cannot unde- 
tectably add entries between their own, corresponding to 
fabricated actions by a non-colluding party EF. 


e I4: Once the chain contains subsequent entries 
by non-malicious parties, two colluding adversaries 
cannot selectively remove entries associated with 
other non-colluding users between them in the 
chain, without being detected by the next audit. 


E.g., colluding users B and D cannot remove entries 
made by non-colluding user C’. 

An adversary in possession of a document can always 
eliminate all elements in the chain, starting from the last 
colluding party’s entry in the chain. For example, a ma- 
licious F’ could remove the entry for EF, if D cooperates, 
and claim that the chain is ([A], [B], [C], [D], [F]). This 
however, is a denial-of-service attack that cannot be pre- 
vented through technical means only. Two parties could 
always collude in this manner in real life through out- 
side channels. Thus, we target chain forgeries that ma- 
liciously add new chain entries and make after-the-fact 
modifications. 


e 15: Users cannot repudiate chain records. 

e 16: An adversary cannot claim that a valid prove- 
nance chain for one document belongs to a different 
document (lineage forgery), without detection at the 
next audit by a superauditor, if not sooner. 
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e 17: If the adversary alters a document without ap- 
pending appropriate provenance records to its chain, 
this will be detected at the next audit by a superau- 
ditor, if not sooner. 

e C1: Any auditor can verify the integrity of the chain 
without requiring access to any of its confidential 
components. Unauthorized access to confidential 
provenance record fields is prevented. 

e C2: The set of parties originally authorized to read 
the contents of a particular provenance record for D 
can be further restricted by subsequent writers of D. 


For illustration purposes, consider Bob, Charlie, and 
Dave editing document D, in that order. The resulting 
provenance chain contains chronologically ordered en- 
tries made by these users. Suppose that, Bob performed 
an operation on D using a proprietary algorithm, and 
does not want the workflow he used to be revealed to 
anyone except auditor Audrey. Property C1 ensures that 
Bob can selectively reveal the records pertaining to his 
actions on D to Audrey. Audrey can verify the integrity 
of the chain and read Bob’s action record, while other 
auditors can only verify the integrity of the chain. 

Suppose that Dave subsequently decides to release D 
to George, who should not learn the private informa- 
tion in Charlie’s provenance records. Property C2 allows 
Dave to give George the provenance chain minus Char- 
lie’s sensitive information, while still allowing George to 
verify the integrity of the chain. 


3 A Secure Provenance Scheme 


We propose a solution composed of several layered 
components: encryption for sensitive provenance chain 
record fields, a checksum-based approach for chain 
records and an incremental chained signature mechanism 
for securing the integrity of the chain as a whole. For 
confidentiality (C1), we deploy a special keying scheme 
based on broadcast encryption key management [24, 27] 
to selectively regulate the access for different auditors. 
Finally, for confidentiality (C2), we use a cryptographic 
commitment based construction. In the following, we 
detail these components. 


3.1 Building Blocks 
3.1.1 Chain Construction 


Provenance records (entries for short) are the basic units 
of a provenance chain. Each entry P; denotes a sequence 
of one or more actions performed by one principal on a 
document D: 


P; = (Ui, Wi, hash(D;), Ci, publica, Ii), 


where 


e U; is an opaque or plaintext identifier for the principal; 
e W,; is an opaque representation of the sequence of docu- 
ment modifications performed by U;; 


7th USENIX Conference on File and Storage Technologies 


e hash(D;) is a cryptographic hash of the newly modified 
contents of D; 

e C; contains an entry integrity checksum; 

e public; is an optional opaque or plaintext public key cer- 
tificate for user U;; 

e J; contains keying material for interpreting the preceding 
fields. 


As a practical matter, at the start of an editing session, 
the provenance system should verify that the current con- 
tents of D match its hash value stored in the most recent 
provenance record. 

We discuss each of these fields in the following sub- 
sections. 


3.1.2 Confidentiality 


Let w; be a representation of the sequence of document 
modifications just now performed by U;. The choice 
of representation for w; is dictated by the application 
domain; for example, w; could be a file diff, a log of 
changes, or a higher-level semantic representation of the 
alterations. The representation should be reversible, if 
we want to allow auditors to check whether the current 
contents of D match its declared history; otherwise the 
representation does not have to be reversible. Given w,, 
W;, is an encrypted version of w;: Wi = E(wi). 

In the remainder of this section, we discuss options for 
E that satisfy the C1 confidentiality requirement. 
Strawman Choices of £. If all auditors should be able 
to read all entries, we can encrypt all w; using a single se- 
cret key & shared with the auditors via a central keystore. 
Le., E(wi) = ex(w:). If only a subset of the auditors 
should be able to read w;, then U; can encrypt one copy 
of w; for each auditor that U; trusts: 

E(wi) = {ex,(wi) : Kais the public key of A, 


an auditor U; trusts} 


This is inefficient, as U; has to include multiple copies of 
w;, which may be quite large. One solution approach is 
to let each auditor in the provenance system correspond 
to an auditor role in the larger environment, and use ma- 
chinery external to the provenance system to determine 
who can get access to the auditor role and to track the 
activities of auditors. Still, the number of different audi- 
tor roles can become quite large in a real-world institu- 
tion. For example, a medical record might be audited by 
lab technicians, billing people, the patient, her guardians, 
physicians, and insurers, each with different rights to see 
details of what was done to the record. Hence, the use of 
auditor roles external to the provenance system needs to 
be coupled with internal measures to minimize the size 
of entries. 

To save space, U; can encrypt w; with a session key 
k unique to this entry, and also include the (shorter) ses- 
sion key k; encrypted with the public keys of the trusted 
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auditors. That is, 


E(wi) = en, (wi) 
I; = {ex,(ki) : Ka is the public key of A, 


an auditor U; trusts} 


In the rest of this paper, we assume that each entry em- 

ploys a session key. Still, J; may have to include many 
encrypted auditor keys. 
Broadcast Encryption for E. To reduce the number 
of keys that must be included in P;, we employ broad- 
cast encryption [24, 27]. We illustrate with a specific 
instance, but any broadcast encryption approach can be 
used here. 

Given a set of up to n auditors during the lifetime of D, 
we build a broadcast encryption tree of height [log N’]. 
Each leaf corresponds to one auditor, and each node con- 
tains a public/private keypair in a PKI infrastructure. We 
give all the public keys in the tree to every user and audi- 
tor. In addition, we give each auditor the private keys in 
all the nodes on the path from its own leaf to the root. 

To allow all auditors to access an entry, we encrypt 
the session key k; with the public key in the root of the 
tree and store it in J;. Any auditor can then decrypt W; 
using the private key in the root node (known only to 
auditors). Similarly, if U; trusts a single auditor A, we 
encrypt k; with the public key k, at the leaf for A, so 
that I; = ex, (ki). If U; trusts an arbitrary subset S of 
auditors, we set 


I; = {ex (ki) : B is the public key of a node in R}, 


where R is a minimum-size set of tree nodes such that 
the set of descendants of FR includes all the leaves for 
auditors in S, and no other leaves. This approach can 
significantly reduce the size of J;, as J; includes only 
log(n — |S|) copies of k;. 

If U; should be opaque, the session key k; can be used 
to hide the identity U of the user responsible for the doc- 
ument modifications represented in entry P;. U should 
identify the principal in a manner appropriate for the ap- 
plication domain. For example, in a file system we can 
define U as: 


U = (user ID, pid, port, ipaddr, host, time) 


In an application domain where U; should be opaque, we 
can set U; = ex,(U), where k; is the session key defined 
above. The same can be done for U’s public key certifi- 
cate public;. Authorized auditors can use I; to decrypt 
U; and public;. 

In some application domains, we may need finer- 
grained control over which auditors and users can see 
which details of an entry, rather than using a single ses- 
sion key to encrypt all sensitive fields in an entry. For 


example, we might be willing to let the billing auditors 
decrypt U; but not W;. In this case, one option is to use 
additional session keys for the other sensitive fields of 
the entry, so that we can control exactly which auditors 
can see which fields. 

Threshold Encryption. To address separation-of-duty 
concerns, we can partition the set of auditors into groups, 
so that decryption of W; requires joint input from at 
least one auditor from each group. Alternatively, we 
can require that at least & different authorized auditors 
act jointly to decrypt W;. For these two approaches, we 
employ secret sharing and threshold cryptography for J; 
[49]. Under the first approach, each group has a differ- 
ent share of the session key 7;, and we use the broadcast 
encryption keys to encrypt those shares. Each auditor 
can decrypt the share for her group. Under the second 
approach, there are as many different shares as auditors, 
and a minimum threshold number of auditors must col- 
laborate to decrypt W;. 


3.1.3 Integrity 


Principle C1 says that every auditor can verify every 
provenance chain, even if he or she cannot decrypt some 
of its W; fields. In some application domains, it is ap- 
propriate to allow every user to act as an auditor in this 
weak sense. In this situation, we can define the integrity 
checksum field C; of an entry as: 


Ci = Su, (hash(Ui, W:, hash(Di), public;, I;)|Ci-1), 


where Sy, means that user U; signs the hash with 
his or her private key. We refer to this approach as 
signature-based checksums, as it creates a signature 
chain that enforces the integrity assurances I1-I7 for 
each individual entry and for the chain itself. For a 
user Audrey to be able to verify the chain, she must be 
able to tell which user wrote each entry in the chain, 
and have access to their public keys. This can be 
accomplished by storing the U; and, if present, public; 
fields as plaintext; if the public; field is empty, Audrey 
must find the public key for U; through external means. 
Audrey can then verify the integrity of the provenance 
chain by parsing it from beginning to end and using the 
C;, values to verify the integrity of each entry. She can 
also verify that the current contents D,, of D match its 
hash in P,,. However, she cannot check that D,, was 
computed properly from D,,-; unless she is allowed 
to decrypt W,,, i.e., she is given access to the session 
key k,, and a reversible representation was used for 
Wn. To verify that all of these transformations were 
performed correctly, Audrey must retrieve her keys from 
the broadcast encryption tree and use them to decrypt 
the I; field of each entry. The J; field holds the session 
key for each entry, which she can use to decrypt all of 
the entry’s fields, thus obtaining the w; fields. From D, 
and reversible w,, Audrey can compute D,,_; and verify 
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that it matches its hash in P,_;. Audrey can repeat this 
process with w;_1, continuing until the entire evolution 
of D has been verified. If Audrey is not authorized to 
access all of the session keys for D, then she can only 
verify that the most recent 7 entries match the contents 
of D, where session key k,_,; is the most recent session 
key that she cannot access. 


3.1.4 Fine-Grained Control Over Confidentiality 


As mentioned earlier, the all-or-nothing approach to 
allowing auditors to view sensitive fields will be too 
coarse-grained for some applications. Sometimes it may 
be hard to foresee which fields may become sensitive 
over time, especially for a long-lived document that may 
cross boundaries between organizations. For example, 
the confidentiality needs for the testing of a particular 
National Geographic DNA sample may be met perfectly 
by a particular set of auditors and session keys, as long 
as the sample stays at its original processing location 
(the University of Arizona). However, a very small per- 
centage of samples produce ambiguous or seemingly un- 
likely results (rare genotypes), and these are sent for ad- 
ditional rounds of testing at other labs. When a sample’s 
chain is sent out to a lab in England, the details of previ- 
ous testing should be eliminated to prevent bias in inter- 
preting the results of the new rounds of tests. 

To provide flexibility in such situations without a pro- 
liferation of broadcast encryption keys, we can use cryp- 
tographic commitments [6] for subfield and field data 
that may eventually be deemed sensitive. With such a 
scheme, we can selectively omit plaintext data entirely 
when sending D’s chain to a new organization, regard- 
less of whether such a need was foreseen when setting 
up the session key(s) for D. The plaintext information 
can be restored to the chain if, for example, D later 
finds its way back to its original organization. To achieve 
this level of control without a proliferation of encryption 
keys, we replace each potentially sensitive plaintext sub- 
field s inside U; or W; by its commitment before com- 
puting the checksum for P;: 


comm(s) = hash(s,rs), 


where 7, is a sufficiently large random number. 

During construction of the signature-based checksum, 
the provenance system uses these hashes instead of the 
actual data items. For example, the name of an unusual 
test performed in W; can be replaced by a commitment, 
while leaving the other more typical tests in W; in plain- 
text. When Arizona sends the chain to an internal party 
trusted to view the plaintext version of U; and/or W;, 
both the commitments and the original plaintext values 
of the unusual test s and r, will be included as usual in 
the provenance entry. When Arizona sends the chain to 


7th USENIX Conference on File and Storage Technologies 


a lab in England, Arizona can remove the plaintext for s 
and r, and send only their commitments. Since the chain 
checksums were computed using the commitments rather 
than the plaintext data, the English lab can still verify 
the integrity of the chain. Access to sensitive values is 
prevented until the chain returns to the University of Ari- 
zona, which can reinstate the plaintext in the chain. If the 
English lab chooses to send out the sample for additional 
testing, it may choose to omit all the plaintext from all 
the Arizona entries of the chain. This level of flexibility 
would be awkward to build into the provenance system 
using only session keys, but is easily accomplished with 
commitments. 


3.1.5 Augmenting Provenance Chains 


While the integrity checksums of Section 3.1.3 com- 
pletely satisfy the assurances I1-I7, we can introduce fur- 
ther optimizations for faster verification and integrity- 
preserving summarization of long provenance chains. 
Provenance chains tend to grow very fast, often becom- 
ing several magnitudes in size larger than the original 
data item and requiring compaction [14]. With our aug- 
mented chains, we can compact the chain by remov- 
ing irrelevant entries, while preserving the validity of 
integrity verification mechanisms, without requiring re- 
computing the signatures. 

The integrity spiral, a redundant, multiple-linked 
chain mechanism, is conceptually similar to skip-lists 
[43]. The basic idea is to compute the checksum(s) of 
each provenance entry by combining the (hash of the) 
current entry, and multiple previous checksums. We pro- 
vide two constructions with different properties and us- 
ages. 

Construction 1. The first construction computes the 
checksum C; of the provenance entry P; as follows 


Ci = Su, (hash(Ui, Wi, hash(D,;), public, I:) 


|Corevs | = |Corevp)s 


where RF is the spiral dimension, and Cy,ey, represents 
a checksum chosen at random from preceding entries in 
the provenance chain. 

Advantages. This construction allows quick detection 
of forgery of entries. Suppose that Mallory modifies the 
entry P; and computes a new checksum C’ based on the 
forged entry and the preceding part of the chain. In the 
singly-linked mechanism, this will be detected when the 
auditor checks the checksum for P;, 1, which Mallory 
is unable to forge (per I2). However, the auditor will 
have to verify the entire chain up to and including the 
entry P; to detect this. We enable quick local verification 
by construction 1, in which multiple subsequent entries 
will be dependent on the checksum C;; of P;. The new 
checksum C’ added by Mallory will cause the checksums 
of these dependent entries to fail, and therefore expose 
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Mallory’s forgery. To evade detection, Mallory will have 
to modify the checksums of all these entries, which in 
turn will affect further subsequent entries. 

Construction 2. In our second construction, we con- 
struct a spiral checksum C; as follows: 


Cz = Ci, |Cig|-- + Cin lCio: 
where R is the spiral dimension, and Ci; is defined as: 


es (hash(U;, Wi, hash(Dj), publics, 1i)|Cp, iti > 0, 


Sy, (hash(U;, W;, hash(D;), public;, I4)|Ci, |Cigl: + ICip)ifi = 0 


where k < i. Note that, unlike Construction 1, we no 
longer have a single checksum per entry; rather we use 
more than one independently computed checksum per 
entry. 

Advantages. Using this construction, we can perform 
quick verification. The auditor may choose to disregard 
the linear checksum (i.e. the chain that links to the pre- 
vious entry’s checksum) and use any of the other dimen- 
sions. Based on the maximum spiral dimension R of that 
chain, it might reduce the cost of verification of an N 
entry chain to I. If the chain is constructed such that 
all entries belonging to a particular event type are linked 
by a given dimension of the checksum, then the auditor 
can skip irrelevant entries, but still be able to verify the 
integrity of the events she is looking for. 

Rather than choosing previous checksums randomly, 
we can use a systematic approach towards building the 
spiral. For example, to construct C;, we use checksums 
from previous entries at distance 1,2,...,2". In Con- 
struction 1, these checksums are all concatenated to the 
hash of the current entry, while in Construction 2, these 
R checksums are separately concatenated with the hash 
of current entry, and signed to form a collection of R 
checksums pertaining to the current entry. 
Integrity-preserving Summarization. Using construc- 
tion 2, we can compact a provenance chain while still 
being able to preserve integrity verification mechanism. 
If P; occurs after entry P; in the chain, and P;’s set 
of checksums C; has a checksum C;;, computed from 
a checksum C;, from the set C; (using Construction 2), 
then that checksum can be used to verify the order of 
these two entries. If there are d entries between P; and 
P;, we can then remove these entries, while being able to 
prove the order. This is not contrary to I, as the auditor 
can use C’;;, to verify the order of P; and P;, while detect- 
ing that d entries have been removed in between them. If 
P; and P; are not directly connected by a checksum, but 
have an intermediate entry P,,,, such that a checksum ex- 
ists in C,, that proves P,, occurs after P;, and a check- 
sum exists in C; that proves P; occurs after P,,, then 
we can keep P,,, and remove all other entries between P; 
and P; during compaction. We can extend this technique 
to link any two entries using some intermediate nodes 
between them. 


3.1.6 Chain Operations 


As discussed in Section 2.2, we are concerned with 
tracking data write operations and document metadata 
changes — including changes in permissions and other 
document metadata. We now discuss the impact of doc- 
ument operations on the corresponding secured prove- 
nance chains. Although our discussion refers specifically 
to file system operations, the same semantics are appro- 
priate for other scenarios, including relational databases. 


read No impact on the provenance chain. 

write A new provenance chain entry is created. 

chmod or chown These operations change document 
metadata. For example, chown can be used to add 
a new user to the list of users with write access. 
The change in metadata is recorded as a provenance 
event. 

copy A duplicate copy of the original document is cre- 
ated, with no change in the original document or its 
provenance. The original document’s provenance 
chain is copied into the new document’s provenance 
chain. A new entry is then added to the new chain, 
to record the copy operation itself. 

delete The document will be removed from the file sys- 
tem, but its provenance chain is not deleted. The 
delete operation is recorded as a metadata operation 
in the provenance chain. The chain is kept until it 
expires (determined by its expiration timeout). 


If users were guaranteed not to circumvent the 
provenance-aware read operation (e.g., by reading di- 
rectly from disk), we could support read-related prove- 
nance entries by persisting per-principal provenance con- 
texts containing all information ever read, similar in na- 
ture to propagated access list mechanisms [58]. How- 
ever, as discussed in Section 2.2, we believe that this 
would give the illusion of security but not the reality, be- 
cause principals have access to outside information chan- 
nels. Also, promulgation of provenance for read opera- 
tions tends to result in a combinatorial explosion in over- 
head that can ultimately render the system unusable [38]. 


3.2 Correctness 


The mechanisms introduced above satisfy the integrity, 
confidentiality and privacy properties outlined in Sec- 
tion 2. 


Theorem 1. An adversary cannot remove entries from the 
beginning or the middle of the chain without detection (I1). An 
adversary cannot add entries in the beginning or the middle of 
the chain without detection (12). 


Proof. (sketch) The proof is straightforward. For (11) 
let us assume that an adversary Mallory has removed the 
entry P;,0 <i <_n. Since the integrity checksum field 
C41 of the subsequent entry is computed by combining 
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the current checksum C;, with W;4,; under an ideal cryp- 
tographic hash function, its verification will fail, there- 
fore revealing the removal of P;. Similarly, for (12), any 
addition of chain entries will be detected in the verifica- 
tion step through the checksum components. 














Theorem 2. Once the chain contains any subsequent entries 
by non-malicious users, a set of colluding adversaries cannot 
insert or remove entries between them in the chain (I3, I4). 


Proof. (sketch) The chained nature of the integrity 
checksum directly ensures this. Specifically, suppose 
that Eve and Mallory are two colluding adversaries who 
are part of the chain, with entry P. followed by P,, later 
on in the chain. Moreover, let P, be non-colluding Al- 
ice’s entry following Eve’s and Mallory’s. The chain will 
thus be (...Pe,...,Pm,..-, Pa,...). Due to the collision- 
free, one-way nature of the chained integrity checksum 
fields, any modification in the chain entries between P. 
and P,,, will naturally show up when attempting to verify 
P,s checksum. 














As discussed in Section 2, if P,, is the last element 
in the chain, Mallory can always remove all entries be- 
tween P,,, and any previous entry by a colluding party, 
e.g., Pe. This DOS attack cannot be prevented through 
technical means alone. 


Theorem 3. When checksums are constructed using formula 
from Section 3.1.3, users cannot repudiate an entry (15). 


Proof. (sketch) This follows by construction when in- 
tegrity checksum C; is implemented using the non- 
repudiable signatures of Section 3.1.3. 














Theorem 4. An adversary cannot successfully claim that a 
valid provenance chain for a given document belongs to a doc- 
ument with different contents (I6). 


Proof. (sketch) This follows directly from the collision- 
free nature of hashing and the fact that a hash of the cur- 
rent document contents is included in each chain entry, 
which is then authenticated using the chained C; check- 
sums. Substituting the chain for a different document 
will be detected by a super auditor when a checksum 
fails to verify. 














Theorem 5. /f a document’s contents are inconsistent with 
its history as recorded in a provenance chain with a reversible 
or plaintext representation for w; fields, then any superauditor 
can detect the discrepancy (I7). 


Proof. By definition, a superauditor can decrypt all de- 
tails of the w,; field in each entry, if the w; fields are 
encrypted with session keys. Otherwise, the w, fields 
are available in plaintext. After verifying the chain, the 
superauditor can apply the w; entries in reverse, repeat- 
edly verifying that the hash of D; included in entry P; 
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matches the hash of its recreated contents. At the last 
verification, Do should be the empty document. 














If a non-reversible representation is used for the w; en- 
tries, or the auditor is not a superauditor, the auditor may 
still be able to tell that the chain is inconsistent with the 
current contents of D. E.g., if a chain entry says that all 
document appendices were deleted, and no subsequent 
entry added any appendices, then application-domain- 
dependent reasoning lets an auditor conclude that some- 
thing is wrong if the document has appendices. 


Theorem 6. Any auditor can verify the chain (C1). Auditors 
can only decode entry details for which they are authorized. 


Proof. (sketch) This also follows directly by construc- 
tion. When deploying non-repudiable signature-based 
checksums as in Section 3.1.3, chain verification in- 
volves only public key signature operations and no other 
secret values. It can thus be performed by any party. 
Now consider the question of whether unauthorized 
parties can access the details of entries. We argue the 
case where a single session key k; is used to encrypt all 
sensitive details in the entry, and the key itself is pro- 
tected using broadcast encryption; the argument is sim- 
ilar if multiple session keys or a single shared key are 
used for this purpose. First, a session key k,; for W; is 
accessible only to principals that can retrieve it by de- 
crypting at least one item in set J;. If a principal can de- 
crypt one of these items, then it possesses a private key 
in the broadcast encryption tree, and must therefore be 
an auditor represented by a leaf in the subtree rooted at 
the private key in question. Thus the principal is an audi- 
tor who should be allowed to obtain the session key, and 
therefore should be allowed to see all data encrypted with 
it, including W; and (if encrypted) U; and public;. 














Finally, we discuss chain verifiability when crypto- 
graphic commitments are used for potentially sensitive 
plaintext, and the plaintext is subsequently removed. 


Theorem 7. The use of cryptographic commitments in place 
of potentially sensitive plaintext subfields of U; or W; does not 
affect the verifiability of the chain. 


Proof. (sketch) The checksum component of the entry 
is computed by using cryptographic commitments rather 
than the sensitive data item’s plaintext. Hence, if the 
plaintext is removed when releasing to an untrusted prin- 
cipal, the chain remains verifiable, as the verification 
mechanism only requires the commitment. The chain in- 
tegrity is also not compromised, due to unforgeability of 
signatures on the checksum entries for other users. 
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4 Empirical Evaluation 


Several avenues are available for implementing secure 
provenance functionality: in the operating system kernel, 
at the file system layer, or in the application realm. 
Kernel Layer. In this implementation approach, prove- 
nance record functions are handled by trapping ker- 
nel system calls, similar to the approach taken in the 
Provenance-aware Storage System (PASS) [38]. The 
main advantage of this approach is its transparency to 
user level applications and the file system layer. Major 
drawbacks include the fact that the logic and higher level 
data management semantics are not naturally propagated 
to the kernel, thus limiting the types of provenance- 
related inferences that can be made. Yet another draw- 
back of such an approach is its limited portability, as any 
new deployment platform will require porting efforts. 
File System Layer. The file system can be made 
provenance-aware and augmented to transparently han- 
dle securing collected provenance information. Similarly 
to the kernel layer implementation, one of the main ad- 
vantages of such an approach is transparency. However, 
persisting provenance state transparently inside the file 
system layer will reduce the portability of the provenance 
assurances, e.g., when provenance-augmented files tra- 
verse non-compliant environments. 

Application Layer. In this approach, the provenance 
mechanisms are offered through user-level libraries. This 
can still maintain the transparency of the previous ap- 
proaches while also allowing for a high degree of porta- 
bility, i.e., by being independent of kernel and file sys- 
tem layer instances. The provenance libraries can be lay- 
ered on top of any file system, making rapid prototyping 
and deployment very easy. Moreover, through dynamic 
linking and by maintaining a compatible interface, exist- 
ing user applications do not need to be recompiled for 
provenance-awareness. 


4.1 The Sprov Library 


We implemented a prototype of the secure provenance 
primitives as an application layer C library, consisting 
of wrapper functions for the standard file I/O library 
stdio.h. The resulting library is fully compatible with 
stdio functionality, in addition to transparently handling 
provenance assurances. We used the basic model intro- 
duced in Section 3.1.2 and 3.1.3 in this prototype. 

In Sprov, a session is defined as all the operations per- 
formed by a user on a file between file open and close. 
When a file is opened in write or append mode, Sprov 
initiates a new entry in the provenance chain of the file. 
Information about the user, application, and environment 
are collected. During write operations, Sprov gathers in- 
formation about the writes. to the file before it is closed. 
Sprov uses a reversible representation of document mod- 
ifications; as discussed earlier, this allows strong verifi- 


cation of the relationship between current document con- 
tents and the document’s provenance chain. The prove- 
nance chain can be used as a rollback log, which can 
form the basis of a versioning file system; we leave this 
for future work. 


At file close, the session ends. Sprov writes a new en- 
try in the provenance chain for the changes made dur- 
ing this session. At this point, the cryptography (imple- 
mented using openssl [54]) associated with the chain in- 
tegrity constructs is executed, as described in Section 3.1. 
The provenance chain is stored in a separate meta-file for 
portability. 


We provide utilities to facilitate provenance collection 
and transfer. When a user logs into her system, the plogin 
utility is invoked, which initializes the session keys and 
loads the user’s preferences and list of trusted auditors. 
Copying and deletion of a provenance-enabled file uses 
the pcopy and pdelete utilities, respectively, as discussed 
in Section 3.1.6. Finally, the pmonitor daemon periodi- 
cally scans for and removes expired provenance chains. 


4.2 Experiments 


Our experiments employed x86 Pentium 4 3.4GHz hard- 
ware with 2GB of RAM, running Linux (Suse) at ker- 
nel version 2.6.11. In this configuration, each 1024-bit 
DSA signature took 1.5ms to compute. The experiments 
used a mix of four of the following drive types: Seagate 
Barracuda 7200.11 SATA 3Gb/s 1TB, 7200 RPM, 105 
MB/s sustained data rate, 4.16ms average seek latency 
and 32 MB cache, and Western Digital Caviar SE16 3 
Gb/s, 320GB, 7200 RPM, 122 MB/s sustained data rate, 
4.2ms average latency and 16MB cache. 


We conducted our experiments using multiple bench- 
marks in a quest to match several different deployment 
settings of relevance. In each case, we compared the 
execution times for the baseline unmodified benchmark 
(with no provenance collection at all), with a run with 
secure provenance enabled. We deployed (i) PostMark 
[26] — a standard benchmark for file system performance 
evaluation, (ii) the Small and Large file microbenchmark 
that has been used to evaluate the performance of PASS 
[38, 48], and (iii) a custom transaction-level benchmark 
meant to test the performance in live file systems with 
file sizes distributed realistically [2, 16], and real-life file 
system workloads [17, 28, 44]. 


We also evaluated two different configurations for 
storing the provenance chain. In the first configuration, 
provenance chains were recorded on disk (Config-Disk), 
while in the second one, provenance chains were stored 
in a RAM disk, with a pmonitor chron daemon periodi- 
cally flushing the chain to disk (Config-RD). 
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Figure 1: Overhead of secure provenance with Postmark. Overhead] 
indicates the overhead using Config-Disk, while Overhead2 is from 
Config-RD setting. The overheads are shown from 0% read bias (100% 
write transactions) to 100% read bias (no write transactions). 


4.2.1 Postmark Benchmark 


We measured the execution time of the Postmark bench- 
mark [26] with and without the Sprov library. A data set 
containing 20,000 Postmark-generated binary files with 
sizes ranging from 8KB to 64KB was subjected to Post- 
mark workloads of 20000 transactions. Each transaction 
set was a mixture of writes and reads of sizes varying 
between 8KB and 64KB. We sampled the performance 
overhead under different write loads by varying the read- 
write bias from 0% to 100% in 10% increments (i.e. the 
percentage of write transactions was varied from 100 to 
0%). The overheads are illustrated in Figure | for both 
Config-Disk and Config-RD and range from 0.5% to 11% 
for Config-RD. 


4.2.2 Small and Large File Microbenchmarks 


The small and large file microbenchmarks [48] have been 
used in the evaluation of PASS [38]. The small file mi- 
crobenchmark creates, writes to and then deletes 2500 
files of sizes ranging from 4KB, 8KB, 16KB, and 32KB. 
We benchmarked the overhead for file creation as well 
as synchronous writes. The results for Config-Disk are 
displayed in Figure 2. 

An interesting effect can be observed. Similar to the 
experiments in PASS, the overhead percentage is quite 
high for small files and decreases rapidly with increasing 
file sizes. We believe this effect can be attributed to disk 
caching. Specifically, for very small file size accesses - 
which go straight to the disk cache, the main overhead 
culprit (crypto signatures) dominates. As file sizes in- 
crease, additional real disk seeks are incurred in both 
cases and start to even out the execution times. Even- 
tually, the overhead stabilizes to under 50% for larger 


7th USENIX Conference on File and Storage Technologies 





Create 
-#Write-seq 











Overhead (Percentage) 


100 














0 5 10 15 20 25 30 35 
File Size (KB) 


Figure 2: Small file system microbenchmark create and write perfor- 
mance for 2500 files. 


files which suggests that roughly 1-3 seek times are paid 
per file and the secure case adds the equivalent of another 
seek time (the crypto signature). 

no prov 


sprovgp Overhead Sprovg RD *Overhead 





13.084 13.328 1.87% 13.308 1.71% 


Seq-write 





Rand-write 15.211 15.390 1.18% 15.285 0.48% 





Table 1: Overhead (in seconds) for large file microbenchmark, un- 
der Config-Disk (CD) and Config-RD (CRD). 


The small file microbenchmark only measures the ef- 
fect of writes to many small files [38]. Often, writes 
to large files can provide more representative estimates 
of typical overheads in file systems. Thus, next we de- 
ployed the Large file benchmark as described in PASS. 
We performed the sequential-write and random-write 
operations of the benchmark. Both unmodified and 
provenance-enhanced versions of the benchmark were 
run, and this time, the disk write-caches were turned off 
to eliminate un-wanted disk-specific caching effects. 

The benchmark consists of creating a 100 MB file by 
writing to it sequentially in 256KB chunks, followed by 
writing 1OOMB of data in 256KB units written in random 
locations of the file. The overheads for sequential and 
random writes are presented in Table 1. In both cases 
(Config-Disk and Config-RD), the overheads are consid- 
erably lower than the overheads reported in [38], despite 
the additional costs of recording all the file writes to its 
provenance chain. 


4.2.3. Hybrid Workload Benchmark 


Benchmarks like Postmark are useful due to their stan- 
dardized nature and ability to replicate the results. Addi- 
tionally we decided to evaluate our overheads in a more 
realistic scenario, involving practical, documented work- 
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Figure 3: Overhead for different number of write transactions and data 
sizes, at 100% write load (Config-Disk). 


loads and file system layouts. We constructed a layout 
as discussed by Douceur et.al [16], which showed that 
file sizes can be modeled using a log-normal distribution. 
We used the parameters 4° = 8.46, 0° = 2.4 to gener- 
ate a distribution of 20,000 files, with a median file size 
of 4KB, and mean file size of 80KB, along with a small 
number of files with sizes exceeding 1GB to account for 
large data objects, as suggested in [2, 16]. 

Our first workload on this dataset involved fixed num- 
ber of write transactions. Under the Config-Disk setting, 
we performed 25K, 50K, 80K, and 100K write transac- 
tions. Between each experimental runs, we recreated the 
dataset, cold-booted the system, and flushed file system 
buffers to avoid variations caused by OS or disk caching. 
In each transaction, a file was opened at random, and a 
fixed amount of data (1KB and 4KB) was written into it. 
We measured the overhead for both appends and random 
writes. These are shown in Figure 3. Constant overheads 
can be observed for each of the 4 configurations, with ap- 
pend situated between 32% and 42%, and random writes 
between 26% and 33%. 

Next, we modeled the percentage of write to read 
transactions according to the data in [17, 28, 44] which 
suggest this varies from 1.1% to 82.3%. To this end, 
we deployed information about workload behavior and 
used parameters for the instructional (INS), research 
(RES) [44], a campus home directory (EECS) [17], and 
CIFS corporate and engineering workloads (CIFS-corp, 
CIFS-eng) [28]. The RES and INS workloads are read- 
intensive, with the percentage of write transactions less 
than 10%. The CIFS workloads are less read-intensive, 
with the read-write ratio being 2 : 1. The EECS work- 
load has the highest write load, with more than 80% write 
transactions. The results are shown in Figure 4 and 5, for 
both the disk-based (Config-Disk) and the RAM-disk op- 
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Figure 4: Overhead for various types of workloads (Config-Disk). 
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Figure 5: Overhead for various types of workloads (Config-RD). 


timized (Config-RD) modes. 


Read-intensive workloads can be seen as almost over- 
head free, with less than 5% overheads for both RES and 
INS (this goes down to less than 2% for the RAMdisk- 
optimized mode). For write intensive workloads, the 
overheads are higher, but still less than 14% for the 
CIFS workloads (Config-Disk), and less than 36% for 
EECS (Config-Disk). With the RAMdisk-optimization, 
the overheads go down to less than 3% for CIFS and 
around 6.5% for EECS. 


Summary. Sprov facilitates collection of provenance 
with integrity and confidentiality assurances, while in- 
curring minimal overhead. Read performance is unaf- 
fected by the use of Sprov. Benchmarks show that, with 
the Config-RD setting, use of Sprov incurs an overhead 
less than 3% in a multitude of realistic workloads. 
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5 Related Work 


Researchers have categorized provenance systems for 
science [50] and investigated the question of how to 
capture provenance information, typically through in- 
strumenting workflows and recording their provenance 
[3, 4, 7, 37, 52, 53]. Other provenance management 
systems used in scientific computing include Chimera 
[18] for physics and astronomy, myGrid [60] for biol- 
ogy, CMCS [39] for chemistry, and ESSW [19] for earth 
science. 

Another technique is to collect provenance informa- 
tion at the operating system layer, with the advantage of 
being hard to circumvent and the disadvantages of be- 
ing expensive and hard to deploy. The Provenance-aware 
Storage System (PASS) [8, 38] takes this approach using 
a modified Linux Kernel. While PASS does not actu- 
ally record the data written to files, it collects elaborate 
information flow and workflow descriptions at the OS 
level. Our techniques of securing provenance chains can 
be used to augment PASS or any such system to provide 
the security assurances at minimal cost. 

The database community has explored a variety of 
aspects of provenance, including the notions of why- 
provenance and where-provenance and how to support 
provenance in database records and streams (e.g., [9, 10, 
11, 12, 57, 59]). Others have examined the applications 
of provenance to social networks [21] and information 
retrieval [31]. 

Overall, the body of research on provenance has fo- 
cused on the collection, semantic analysis, and dissem- 
ination of provenance information, and little has been 
done to secure that information [8, 25]. One exception is 
the Lineage File System [46], which automatically col- 
lects provenance at the file system level. It supports ac- 
cess control in the sense that a user can set lineage meta- 
data access flags, and the owner of a file can read all of 
its lineage information. However, this does not meet the 
challenges (I1-I7,C1-2) for confidentiality, integrity and 
privacy of provenance information outlined in [8, 25] and 
discussed in this paper. 

Outside the domain of provenance, researchers have 
used entanglement — mechanisms of preserving the his- 
toric states of distributed systems in a non-repudiable, 
tamper-evident manner [32, 45]. This provides similar 
assurances to the ones sought here for the realm of sys- 
tems, yet does not handle provenance for information 
flows and individual data records. 

Source code management systems (SCM) target the 
provenance needs of a particular application domain. For 
example, Subversion [15], GIT [30], or CVS [5] with se- 
cure audit trails can provide integrity assurances for ver- 
sions in a centralized file system. GIT, Monotone [1], 
and several other systems also provide support for a dis- 
tributed infrastructure. These systems employ a logically 
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centralized model where users maintain local histories 
and use a virtual (centralized) repository to merge and 
synchronize their local repositories. Our approach is in- 
tended for applications with a more fully decentralized 
model, where documents and their histories are physi- 
cally passed between users in separate administrative do- 
mains that may not trust one another. In addition, as our 
approach is intended to meet the needs of many poten- 
tial applications, we have worked to provide much higher 
performance than a SCM system requires 

Verifiable audit trails for versioning file systems can 
use keyed hash-chains to protect version history [42]. 
Under this approach, auditors are fully trusted and share 
a symmetric key with the file system for creating the 
MACs. The audit authenticators need to be published to 
a trusted third party, which must provide them accurately 
during audits. Our approach must also handle malicious 
auditors who could easily falsify the audit. 

Similarly to audit trails, secure audit logs based on 
hash chains have been used in computer forensics [47, 
51]. Such schemes work under different system and 
threat models than secure provenance. By their very na- 
ture, audit logs are stationary and protect the integrity 
of local state. In contrast, provenance information is 
highly mobile and often traverses multiple un-trusted do- 
mains. Moreover, audit logs rarely require the selective 
confidentiality assurances needed for provenance. For 
example, the mechanisms proposed in [47] secure logs 
as a whole, but do not allow authentication of individual 
modifications. Additionally, provenance is usually asso- 
ciated with a digital object (e.g. file). This association 
introduces attacks that are not applicable to secure au- 
dit logs. Finally, a majority of schemes function under 
the assumption of single (or very few) parties process- 
ing the audit log and computing checksums — a different 
model from the case of provenance chains where multi- 
ple principals’ access is required throughout the lifetime 
of a provenance chain. 

Secure Untrusted Data Repository (SUNDR)[29] pro- 
vides a notion of consistency for shared memory (called 
fork consistency) akin to the integrity property provided 
for provenance records in our systems. While our tech- 
niques for ensuring chain integrity are related to those 
used in SUNDR, the adversarial model of SUNDR is 
different from ours. In SUNDR, a set of trusted clients 
communicate with an untrusted server storing a shared 
filesystem. In contrast, our system does not employ a 
central server, and allows any number of users to be cor- 
rupted. Moreover, SUNDR does not address confiden- 
tiality issues. 

Multiply-linked hash chains have been used for signa- 
ture cost amortization in multicast source authentication 
[20, 23, 35,41]. Our spiral chain constructs are similar in 
principle. One main difference however is that such hash 
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chains all assume a single sender signing the message 
block containing the hashes. We can adopt these methods 
to amortize signature costs in consecutive provenance 
chain entries from the same principal, but with multi- 
ple principals, we need chaining using non-repudiable 
signatures. Also, many of the hash-chain schemes re- 
quire the entire stream to be known a priori, an assump- 
tion not applicable to provenance deployment settings. 
Finally, the second spiral construction allows integrity- 
preserving compaction, which is not possible with the 
hash chains. 

Integrity of cooperative XML updates has been dis- 
cussed in [33], where document originators define a flow 
path policy before dissemination and recipients can ver- 
ify whether the document updates happened according 
to this flow policy. In contrast, for flexibility and wider 
applicability, our model and integrity assurances do not 
require the existence of pre-defined flow path policies, 
in order to provide the integrity assurances described in 
Section 2. 


6 Conclusion 


In this paper, we introduced a cross-platform, low- 
overhead architecture for capturing provenance informa- 
tion at the application layer. Our approach provides fine- 
grained control over the visibility of provenance infor- 
mation and ensures that no one can add or remove entries 
in the middle of a provenance chain without detection. 
We implemented our approach for tracking the prove- 
nance of data writes, in the form of a library that can be 
linked with any application. Experimental results show 
that our approach imposes overheads of only 1-13% on 
typical real-life workloads. 
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Abstract 


Versioning file systems provide the ability to recover 
from a variety of failures, including file corruption, virus 
and worm infestations, and user mistakes. However, us- 
ing versions to recover from data-corrupting events re- 
quires a human to determine precisely which files and 
versions to restore. We can create more meaningful ver- 
sions and enhance the value of those versions by captur- 
ing the causal connections among files, facilitating se- 
lection and recovery of precisely the right versions after 
data corrupting events. 

We determine when to create new versions of files au- 
tomatically using the causal relationships among files. 
The literature on versioning file systems usually ex- 
amines two extremes of possible version-creation algo- 
rithms: open-to-close versioning and versioning on ev- 
ery write. We evaluate causal versions of these two algo- 
rithms and introduce two additional causality-based al- 
gorithms: Cycle-Avoidance and Graph-Finesse. 

We show that capturing and maintaining causal rela- 
tionships imposes less than 7% overhead on a versioning 
system, providing benefit at low cost. We then show that 
Cycle-Avoidance provides more meaningful versions of 
files created during concurrent program execution, with 
overhead comparable to open/close versioning. Graph- 
Finesse provides even greater control, frequently at com- 
parable overhead, but sometimes at unacceptable over- 
head. Versioning on every write is an interesting extreme 
case, but is far too costly to be useful in practice. 


1 Introduction 


Versioning file systems automatically create copies (i.e., 
versions) of files as they are modified, providing numer- 
ous benefits to users and administrators. Users find ver- 
sions convenient when they inadvertently remove or cor- 
rupt a valuable file. Administrators find that versioning 
systems greatly reduce the rate of requests to restore files 


from backup. In addition, versioning file systems provide 
the means to clean up after a data-corrupting intrusion. 
Unfortunately, versioning alone does not help in identi- 
fying the most recent “good” version of a file or how data 
corruption may have spread from one file to another. 

Snapshotting, often implemented using checkpoints, 
is another approach for versioning that is common for 
backup systems. Such a system periodically takes a 
whole or incremental image of the file system and then 
uses copy-on-write for data modified after the snapshot. 
Snapshots, similar to versioning file systems, cannot help 
identify the most recent “good” version of a file. Another 
drawback of snapshot-based systems is the granularity of 
recovery: it is not possible to undo changes made be- 
tween snapshots. 

Several new file system designs capture causality rela- 
tionships among files for a variety of different purposes. 
For example, Taser [7] captures causality information to 
address the challenge of identifying files tainted by an 
intrusion or corrupted by administrative errors. Back- 
Tracker [11] captures causality information to analyze 
intrusions. Provenance-aware storage systems (PASS) 
capture the provenance or digital history of files to let 
users answer questions such as, “How do these two files 
differ?” “What files are derived from this one?’ “From 
what files is this file derived?” “How are these two files 
related?” [14]. Other systems [19] preserve causal rela- 
tionships to enhance personal search capabilities. 

Combining versioning and the capture of causal rela- 
tionships introduces functionality not available in exist- 
ing systems. For example, suppose a system has been 
compromised by a data-corrupting worm. Upon iden- 
tifying a tainted file, the causal relationships provide a 
mechanism to trace backwards to find the last version 
prior to the corruption and then trace forward to identify 
all the files tainted by that corruption. These two traces 
precisely identify the appropriate files and versions that 
need to be restored to recover from the intrusion. With- 
out versioning, an administrator’s only recourse is to re- 
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store the system to a clean snapshot, potentially losing 
valuable user data. Without causal relationships, the ad- 
ministrator cannot know how the corruption has spread. 

Similarly, imagine a scenario where a physics simula- 
tor produces results today that differ from those produced 
yesterday. Here, causal data can reveal the cause of the 
difference, while versioning data can recover to the ear- 
lier (and presumably correct) version. 

Conventional versioning file systems [3, 10, 15, 18, 
24] typically use one of two techniques to determine 
when to create new file versions: “open-close” and 
“version-on-every-write”’. In the “open-close” approach, 
versions are defined relative to open and close events. 
Typically a new version is created upon the first block 
update after an open and all writes that occur before the 
final close operation appear in that new version. This 
has the potential to lose valuable information. For ex- 
ample, consider the split-logfile vulnerability in Apache 
1.3 [25]. The vulnerability, present in a helper program 
called split-logfile, allows any file in the sys- 
tem with a .1log file extension to be written. Assume 
that a database server running on the same machine as 
the Apache split-—logfile helper uses a file called 
db .1og to store its recovery information. This database 
server opens the file when it is started and keeps it open. 
The first time the database server writes to the log file, an 
“open-close” system will create a new version of it. That 
version will remain the current version until the database 
is shut down. Now, suppose an attacker exploits Apache 
and writes new data in db. log. At this point, the log 
consists of some old “good” log entries and some new 
*bad” log entries. Even if an administrator finds that 
db .1log has been corrupted, the only version available 
for recovery is the one that existed before the database 
server wrote anything. If the administrator restores that 
version, all database operations since the server started 
will be lost. One might turn to the “version-on-every- 
write” algorithm, which creates a new version each time 
data is written to the file; this approach ensures that no 
data is lost, but it can be expensive in both time and 
space. 

Fortunately, versioning algorithms informed by 
causality relationships produce versions that facilitate re- 
covery to the pre-tampering state without the overhead of 
versioning on every write. In the Apache example above, 
causality-based techniques force a new version of the log 
file to be started before the attacker’s writes are applied. 
We introduce two such causality techniques: Cycle- 
Avoidance and Graph-Finesse. Cycle-Avoidance con- 
servatively declares new versions using knowledge local 
to the objects being acted upon (i.e, process, files, pipes, 
etc). Graph-Finesse is less conservative, using global 
knowledge to maintain an in-memory graph of depen- 
dencies between objects, declaring a new version when- 
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ever adding a dependency edge introduces a cycle. We 
discuss these algorithms in more detail in Section 4. 

As we show in Section 6, any kind of versioning, 
when coupled with maintenance of causal relationships, 
provides significant value. In the presence of long- 
running and/or concurrent execution, Cycle-Avoidance 
creates the versions necessary to recover from corrup- 
tion without introducing significant overhead above that 
of open/close. Graph-Finesse creates slightly fewer ver- 
sions than Cycle-Avoidance, but it pays significant over- 
head in workloads that read and write a large number of 
files. Versioning on every write exhibits sufficiently high 
overhead that it is impractical. 

The contributions of the paper are as follows: 


e New functionality arising from the integration of 
versioning with causal data, 


e New causality-based techniques for versioning, 


e A prototype embodying versioning and causal rela- 
tionships, and 


e Anevaluation of four causality-based versioning al- 
gorithms. 


The rest of the paper is organized as follows. In Sec- 
tion 2, we introduce the system upon which we build our 
causality-based versioning system and describe its essen- 
tial architectural details. In Section 3, we discuss several 
novel use cases that causality-based versioning enables. 
In Section 4, we present details of the new versioning 
algorithms. In Section 5, we discuss the versioning file 
system implementation. In Section 6, we present eval- 
uation results. In Section 7, we discuss related work. 
Finally, we conclude in Section 8. 


2 Causal Relationships with PASS 


We extended PASS [14] (Provenance-aware storage sys- 
tem), the causality-collection system we built, to capture 
versioning information. We chose PASS as it captured 
precisely the data that we needed and has a modular ar- 
chitecture that made it easy to add versioning. In this 
section, we provide a high level overview of the PASS 
architecture to provide the necessary background to un- 
derstand our version creation algorithms and implemen- 
tation. 

Figure 1 shows the PASS architecture. Its key com- 
ponents are: 


e Interceptor: The interceptor intercepts system 
calls, passing information to the observer, described 
next. The interceptor is a thin layer that is oper- 
ating system specific, while the remaining compo- 
nents can be operating system independent. 
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Figure 1: PASS Architecture 
e Observer: The observer translates system call 


events to provenance records. For example, when 
a process P reads a file A, the observer generates 
a record P — A, to indicate that the process P 
depends on the file A. It is precisely these prove- 
nance events that capture the causal dependencies 
in which we are interested. 


e Analyzer: The analyzer processes the stream of 
provenance records, making sure that there are no 
cyclic dependencies among objects. This is where 
we implement our different versioning algorithms. 


e Distributor: The distributor maintains provenance 
for transient objects such as pipes and processes. 
When these transient objects become part of the 
ancestry of a regular file on a PASS volume, the 
distributor creates a virtual object for them on the 
PASS volume and stores their records to the vol- 
ume. Creating a virtual object avoids the need for 
duplicating the provenance of transient objects each 
time we need to create a causal dependency involv- 
ing them. Similar to regular files, PASS versions 
these transient objects when necessary. 


e Lasagna: Lasagna is the provenance-aware file sys- 
tem that stores provenance records along with the 
data. Internally, lasagna writes the provenance into 
a log. 


e Waldo: Waldo is a user-level daemon that reads 
provenance records from the log and stores them in 
a database. After a data corruption, PASS’s recov- 
ery tools consult this database to determine causal 
relationships. 
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The observer and analyzer are central to causality- 
based versioning and are discussed further in Section 4. 

PASS maintains provenance, and therefore causal rela- 
tionships, for both persistent and transient objects. A re- 
lationship between two files (persistent objects) is there- 
fore expressed indirectly as two relationships, each be- 
tween a file and a process. For example, when a process 
issues a read system call, PASS creates a record stat- 
ing that the process causally depends upon the file being 
read. When that process then issues a write system 
call, PASS creates a record stating that the written file 
depends upon the process that wrote it. The causal rela- 
tionships described by these records are the heart of the 
data we use, both to instantiate file versions and also to 
choose files to restore after a corruption. 


3 Use Cases 


In this section, we discuss use cases that demonstrate the 
novel functionality enabled by causality-based version- 
ing. 


3.1 Intrusion Recovery 


Recently, one of the authors upgraded the software on his 
system. The upgraded packages included coreutils, 
the package that contains 1s. This author completed 
the upgrade and continued working for the rest of the 
evening. However, when he came back the next morn- 
ing, 1s emitted the message: 


/bin/ls: unrecognized prefix: do 

/bin/ls: unparsable value for 
LS_COLORS environment 
variable 





The author then searched the web to learn that the behav- 
ior might be the result of an intrusion. He promptly in- 
stalled chkrootkit and rkhunter, two popular pro- 
grams to verify that a system has been hacked. However, 
both the programs failed to locate any known rootkits. At 
this point, it was unclear if the aberrant behavior of 1s 
was due to the update to coreutils the evening before 
or an intrusion. 

Had he been running PASS, the author could have 
followed the chain of provenance dependencies of 
the file /etc/DIR_COLORS. (LS_COLORS is de- 
rived from DIR_COLORS). If the provenance chain of 
DIR_COLORS indicated that it was modified by the 
package manager or a legitimate system utility, then the 
system had not been hacked; otherwise, the system had 
been hacked. In the event the system had been hacked, 
there would be only one option: wipe the system and re- 
install. This is obviously undesirable, especially since 
the author had recently completed a re-install in order to 
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upgrade. Faced with this situation, the author longed for 
a versioning file system coupled with PASS that would 
permit him to selectively roll back the affected files to the 
version just before they were corrupted. Had that been an 
option, he could have continued using his system without 
a full re-install. 


3.2 Reproducing Research Results 


Systems that collect provenance frequently do so to facil- 
itate the reproduction of scientific results [22]. Consider 
the common scenario where a scientist collects data from 
some device (e.g., a telescope), transforms it through 
many intermediate stages and produces a final output file. 
Suppose he finds, a few months after publishing the data, 
that one of the programs in the intermediate stages had a 
bug. The scientist has no option but to begin anew with 
the raw data and then re-run all the experiments that he 
thinks may have been affected by the bug. 

If, however, he has complete provenance of his data, 
he can identify precisely which data sets were affected by 
the corrupt program. Hence, he need only re-run those 
experiments from the point at which the corruption oc- 
curred. Further, if the raw data is unavailable (frequently, 
raw data is archived and removed from data processing 
systems once it has undergone its initial pre-processing), 
he can use a versioning system to recover the missing 
raw data to re-run the experiments. This method obvi- 
ates the need to retrieve the raw data from archives or a 
central repository. 


3.3 System Configuration Management 


Software configuration management is extremely 
hard [26]. New software installation can (and regularly 
does) break existing software, because packages interact 
with each other through various agents: _ libraries, 
registries, configuration files, and even environment 
variables. 

Provenance systems can help alleviate some of the 
problems of configuration management by helping users 
recover from a corrupt configuration. One of our au- 
thors recently installed a new music player on his sys- 
tem. The music player, in turn, depended on a number of 
libraries that needed to be updated or downloaded. Af- 
ter the install, the music player worked well, but much 
to the annoyance of the author, his movie player ceased 
to work. The author guessed that is was probably be- 
cause of the updates to the libraries. The author tried to 
use the package manager to revert the system to the state 
that existed prior to the music player install. Using the 
system package manager to remove the music player did 
not help, because as far as the package manager was con- 
cerned, the library updates were independent of the mu- 
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sic player install. The author could not manually undo 
the library updates as he did not know the list of libraries 
that were installed. A record of the causal relationships 
between the libraries, the music player, and the movie 
player would have helped the author identify which of 
the libraries were common to the music player and the 
movie player and hence would have helped to point out 
(or narrow down) the offending library. Such causal data 
coupled with a versioning file system provides exactly 
the information needed to permit the user to revert all 
the modified files to a state prior to that of the music 
player install. Since the versioning system even restores 
the package manager database to its prior state, it pre- 
serves the consistency of the system. 

The problems outlined in this use case arise mainly 
because package management dependencies are gener- 
ated manually and are brittle in nature. Alternatively, one 
could use causal data recorded by PASS to gather the true 
dependencies of a package that, in turn, can help perform 
better roll backs after installation. 


3.4 Database Recovery 


Traditional databases are designed to recover from soft- 
ware and hardware crashes. However, those mechanisms 
are not sufficient to recover from a human error or a com- 
promise due to the time gap between the event occur- 
rence and the detection. In such cases, recovery involves 
a manual sanitizing of the database. Causality-based ver- 
sioning can help reduce the amount of effort and the 
downtime of the service as we show in the following ex- 
ample. For simplicity, we assume that the database is not 
running transactionally. 

A faulty client can corrupt a database by either adding 
incorrect entries, removing valid entries, or updating ex- 
isting entries incorrectly. Once the faulty client is de- 
tected, one can use the causal data collected by PASS 
with the versioning data to recover the database. Recov- 
ery is simple in the case where the last database update 
is by the faulty client. In this case, recovery simply en- 
tails reverting the database to a version before the client 
updated it. In the case where legitimate actions are in- 
terleaved with the actions of the faulty client, automatic 
recovery is hard as both legitimate and faulty updates are 
coalesced in main memory and then written to disk. In 
this scenario, one can use causal information to recover 
the database to a version before the faulty client’s modi- 
fications and a version after the faulty client’s modifica- 
tions and then compute a difference of the data dump 
between the two versions. The difference in the data 
dump will contain both legitimate and illegitimate data 
that needs to be sanitized, but the number of rows re- 
quiring manual checking is much smaller than the entire 
database. 
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3.5 Intellectual Property Compliance 


Causality-based versioning can also help verify intellec- 
tual property (IP) compliance and enable removal of IP 
violations. For example, companies that use and develop 
both proprietary software and open source software rou- 
tinely require pre-release checks to make sure the pro- 
prietary software has not been tainted by open source 
software and vice-versa. In most cases, this is a te- 
dious, manual process. One can use causal relationships 
to identify paths between source files with different li- 
censing models. When coupled with a versioning file 
system, the system supports rigorous analysis of such li- 
cense pollution and potentially explicit means of revert- 
ing to untainted states. 


4 Versioning Algorithms 


As described in Section 2, in the PASS architecture the 
observer generates causal relationship data and the an- 
alyzer prunes these relationships, removing duplicates 
and cycles. Programs generally perform I/O in rela- 
tively small blocks (e.g., 4 KB), issuing multiple reads 
and writes when manipulating large files. Each read 
or write call causes the observer to emit a new record, 
most of which are identical. The analyzer removes these 
duplicates. Meanwhile, cycles can occur when multiple 
processes are concurrently reading and writing the same 
files [2]. Cycles in causality are nonsensical and must be 
avoided. In the PASS system, the analyzer prevents cy- 
cles by forcing new versions of objects to be created. It 
does this by choosing when to freeze them; that is, when 
to declare the current version “finished” and begin a new 
version. Transient objects (processes and pipes) can also 
be frozen to break cycles. To experiment with causality- 
based versioning, we installed the versioning algorithms 
in the analyzer. In this section, we describe the version- 
ing algorithms we used, referencing Table 1 as an exam- 
ple sequence of events. 


4.1 Traditional Algorithms 


In the open-close (OC) algorithm, the last close of a file 
(that is, when no more processes have the file open) trig- 
gers a freeze operation. The next open and write triggers 
the start of a new version. This algorithm does not pre- 
serve causality; some sequences of events (including the 
example in Table 1) produce cycles. Figure 2 illustrates 
how this happens. 

The version-on-every-write (ALL) algorithm creates a 
new version on every write. This avoids any violations 
of causality but potentially creates a large number of ver- 
sions. In this sense, it is the most conservative of the 
algorithms we consider. The code for this algorithm is 
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Step P Q 
1 read A 
2 read B 
3 write B 
4 write A 
5 read A 
6 read B 

















Table 1: Example scenario to illustrate the versioning algo- 
rithms. Each read and write operation is enclosed by an 
open and close. All objects are initially at version one. 


quite simple; because each write results in a new ver- 
sion of a file and each read results in a new version of a 
process, each record refers to a distinct version of some- 
thing. Thus, there is no need to check for either dupli- 
cates or cycles. 


4.2 Cycle-Avoidance 


The Cycle-Avoidance (CA) algorithm, as its name sug- 
gests, preserves causality by avoiding cycles. For each 
object, the analyzer maintains a unique object ID (as- 
signed at object creation), a version number (incre- 
mented on each freeze), and an ancestor table. The an- 
cestor table records the object ID and version number of 
all the immediate ancestors of the object. When CA re- 
ceives a record of the form A; — B,, it stores B; in 
the ancestor table of A. CA creates a new version of an 
object whenever it adds a new ancestor, where different 
versions are considered distinct, to the object’s ancestor 
table. Doing so guarantees that no cycles will be created. 
CA differs from version-on-every-write, because not all 
writes add new ancestors. 

When the analyzer receives a record of the form A; > 
B;, it examines the ancestor table of A for By, that is, 
some version k of object B, and uses the following rules 
to perform both duplicate detection and cycle handling. 


e Rule CA.1: If no By exists in the ancestor table of 
A, then B is a new ancestor for A. Issue a freeze 
operation on A to create a new version and add B; 
to the ancestor table of A. 


e Rule CA.2: If B;, exists and j = k, then the causal- 
ity record A; — B; is a duplicate and we discard 
the record. 


e Rule CA.3: If By, exists and 7 < k, the new 
record refers to a version older than the most re- 
cent one recorded in A’s ancestor table, and the ex- 
isting causality relationship A; — B, subsumes 
any causal relationships of B;. Hence, the causality 
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Figure 2: Illustration of the open-close algorithm for the sequence in Table 1. The arrows represent causality and point opposite 
to data flow. In (2.3), a new version of B is created as it is the first write since the last close. In (2.4), a new version of A is created 
for the same reason. A cycle Az > Q — Bz — P — Ag (thick lines) results on the last read as shown in (2.6). 
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Figure 3: Illustration of the Cycle-Avoidance algorithm for the sequence in Table 1. In (3.1), a new version of P is created by 
the rule CA.1. In (3.2), a new version of @ is created by the rule CA.1. In (3.3), a new version of B is created by the rule CA.1. 
In (3.4), a new version of A is created by the rule CA.1. In (3.5), a new version of P is created by the rule CA.4. In (3.6), a new 
version of @ is created by the rule CA.4. The end result is that there are no cycles. 


record A; — B; is a duplicate and we discard the 
record. 


e Rule CA.4: If B;, exists and 7 > k, B; is a newer 
version than the B,; in A’s ancestor table. Thus, 
B; depends on some objects on which B; did not 
depend. Therefore, we issue a freeze on A to create 
a new version and update the ancestor table of A to 
name B; instead of By. 


Figure 3 illustrates the behaviour of the Cycle- 
Avoidance algorithm for the example sequence in Ta- 
ble 1. 


4.2.1 Self-Cycles 


Self-cycles arise when a process is both reading and writ- 
ing the same file. Some programs, such as the GNU 
linker, generate self-cycles as they repeatedly read from 
and write to their output files. The Cycle-Avoidance al- 
gorithm as described can create a large number of unnec- 
essary versions in this situation. To avoid this, we track 
each object’s last ancestor. When the analyzer receives 
a record of the form A; — B,, it makes the following 
check: 


e Rule CA.self-cycle: If the last ancestor of B; is 
A;, the new record creates a self-cycle; discard the 
record. 


The above rule tells us that the last version change of 
B occurred because of the current version of A. In that 
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case, the data being fed to A originated from A itself, and 
we have a self-cycle. Records representing self-cycles do 
not add information and can be dropped immediately. 


4.3 Graph-Finesse 


As described above, Cycle-Avoidance decides when to 
create new versions using local knowledge about the ob- 
ject to which a dependency refers. In contrast, Graph- 
Finesse (GF) uses global knowledge to make its deci- 
sions. It maintains a global directed graph of the causal 
dependency relationships between objects. The GF al- 
gorithm checks each new record against the graph and 
forces the creation of a new version of a single file if and 
only if adding the record would otherwise create a cycle. 
The name arises from the fact that it picks out a compar- 
atively small number of new versions to create while still 
preserving causality. 

Given a record A; — B;, GF uses the following rules: 


e Rule GF.dup: Check if A; — B; already exists in 
the causal-dependency graph. If so, the record is a 
duplicate; discard it. 


e Rule GFdetect: Check if B; —* Aj, that is, if a 
path of zero or more steps exists linking B; to Aj. 
If so, then A; — B; —* A; forms a cycle. Freeze 
A, creating A;,1, change the record to Aj; — B;, 
and add this information to the graph. There will 
now be no cycle; because A; is new, it cannot be 
an ancestor of B;. 
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Figure 4: Illustration of the Graph-Finesse algorithm for the sequence in Table 1. In (4.3) and (4.4), new versions of B and A are 
created by the open-close mechanism. In (4.6), a path exists from Bz to Q1 (thick lines), so Q2 is created by rule GF.detect and no 
cycle is formed. As we can see, Graph-Finesse creates fewer versions than Cycle-Avoidance. 


e Rule GFdefault: Otherwise, add A; — B; to the 
graph. 


By design, GF also subsumes open-close versioning 
and includes the freeze-on-last-close behavior as de- 
scribed in Section 4.1. 

To keep the graph from growing without bound, it is 
important to prune it. Any node in the graph (represent- 
ing some version of some object) may be pruned if that 
node is frozen, that is, any future write to the object will 
create a new version or be to some already existing newer 
version, and all the ancestors of the node are frozen. (If 
an ancestor is unfrozen, writing to it may cause a cycle.) 

It is therefore possible to bound the size (or diame- 
ter or other measure) of the graph at the cost of creating 
extra versions, by freezing unfrozen objects that would 
otherwise be pruned. We have considered this but not 
implemented it, because there seems to be little need for 
it in practice. 

Graph-Finesse can consume more CPU for some 
workloads as it has to traverse the graph on every record 
addition. Our evaluation confirms this. The memory 
consumption of Graph-Finesse, however, is comparable 
to the other algorithms. Graph-Finesse can also use the 
self-cycle logic described in Section 4.2.1. Figure 4 illus- 
trates the behaviour of the Graph-Finesse algorithm for 
the example workload. 


4.4 Discussion 


We now discuss how pruning entries in CA and GF does 
not affect the use cases discussed in section 3. The OC, 
CA, and GF algorithms record the same versioning data 
for the first three use cases as they involve an application 
making a one time modification to all the data files in- 
volved. Such changes generate the same causal informa- 
tion and data versions for the three algorithms. The ALL 
algorithm generates the same causal information and ver- 
sioning data as the other algorithms with the difference 
being that the data is spread over more versions. For the 
remaining two use cases, the causal algorithms record 
different causal information and versioning data. We ex- 


plain how they differ in the database recovery ( 3.4) use 
case. 

The database server (server) opens the database at 
startup, writes to it in response to client requests, and 
closes the file at shutdown. In OC, the first time the 
server writes to the database, it creates a new version. All 
subsequent database modifications are part of this new 
version and old data is not copied before applying these 
modifications, because the file is still open. Thus, restor- 
ing the old version loses any legitimate modifications 
between the first write and the faulty writes. Clearly, 
OC versioning is not sufficient for the database recovery 
use case. ALL has sufficient information as it recorded 
all causal information and versioned on every database 
modification. However, versioning the database this fre- 
quently is potentially very inefficient. 

In CA, reception of a client request produces a new 
version of the server process. This, in turn, triggers a 
new version of the database, when the server modifies 
the database. Multiple modifications resulting from a 
single client request do not create multiple versions, be- 
cause the server’s version does not change. However, 
when a new (faulty) request arrives, the server’s version 
increments as does the database’s version. Interleaved 
requests from multiple clients will generate many ver- 
sions. Thus, in a single client case, CA behaves like OC 
and when clients interleave, it behaves like ALL. Hence, 
there is sufficient information to undo a faulty client’s 
updates. The GF algorithm behaves in a manner similar 
to CA. 


5 Implementation 


In this section, we describe our versioning design 
choices, our implementation, and the limitations of 
our system. Our versioning design was influenced by 
the comprehensive versioning file system (CVFS) [24], 
which explored metadata efficiency in versioning file 
systems. CVFS showed that logging all the modifica- 
tions of a file to a journal is more efficient than creat- 
ing a new inode for each version. Hence, we use a redo 
log for storing versioning data for files. Although CVFS 
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uses multiversion B-trees to handle versions of directo- 
ries, we store directory metadata in an undo log, as this 
is much simpler to implement. We implement our ver- 
sioning system by modifying Lasagna, the PASS storage 
engine. Lasagna is a stackable file system, which used 
eCryptfs [8] as its starting code base. 


5.1 File Versioning 


For each file foo, we maintain a version file v; 12345, 
where 12345 is the inode number of foo. The version 
file is a log where we record old data before updating 
the primary file. Inode numbers are never reused, be- 
cause we never delete files. For locality, we keep the ver- 
sion files and the files they describe in the same directory. 
Users cannot access the version file directly as we filter 
out the version file names in readdir and lookup. 

The version file consists of the following three types 
of log records. The version record marks the start of data 
records for a version. It contains the version number and 
the metadata attributes of the version such as the file size, 
uid, gid, etc. The page record holds old data being over- 
written. This contains the data and the page number in 
the file from which the data came. Finally, the beginptr 
record is the last record of a version; it records the loca- 
tion of the corresponding version record to allow scan- 
ning backwards. 

Each version begins with a version record, ends with 
a beginptr, and has some number of intervening page 
records. We write a beginptr record to the version file 
when a freeze request is issued on the file. We write a 
version record on the first write call on the file after a 
freeze. 

When a program issues a write call on a file, we 
read the pages that the write call overwrites and write 
a page record for each. When a file is truncated, we 
log all discarded pages. We record each page only once 
per version. For example, if the file system receives 
two 4K writes at offset 0, we log data only for the 
first (assuming the version does not change between the 
writes). On an unlink, we rename the target file to 
v;12345; deleted where 12345 is the inode num- 
ber. A native file system could remove the file blocks 
from the primary file and append them to the version file. 
Lasagna, however, is a stackable file system and does not 
control the file layout of the underlying file system. In- 
stead, we rename the file on the last unlink. This is 
more efficient than copying blocks from the primary file 
to the version file, especially for large files. 


5.2 Directory Versioning 


For each directory, we maintain (within the directory) a 
version file named by inode number, as we do for files. 
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The directory version file has version and beginptr log 
records as we do for a regular file. It also has three 
other log record types: The add entry record represents 
an addition to a directory via create, link, mkdir, 
or symlink. This record contains the inode and version 
of the directory to which we are adding, and the name, 
inode, and version of the entry, which can be a file or 
a directory. The del entry record represents a removal 
from a directory by unlink or rmdir. This contains 
the same data as the add entry record. The rename entry 
records a directory entry being moved from one direc- 
tory to another. Where appropriate, this is written to the 
version logs of both directories involved in the rename. 
This record contains the inode and version of the new di- 
rectory, the old directory, the old file, and the overwritten 
target file (if it existed), and the old and new file names. 

Because version logs are never deleted and files are re- 
named on deletion, directories are never truly empty, so 
our versioning file system cannot perform rmdir oper- 
ations. Instead, when a directory is removed, we check 
to see if all the files in the directory are either v,;inode or 
v; inode; deleted files, i.e., all files are either ver- 
sion files or deleted files. If so, the directory is “virtu- 
ally” empty, and we can move the directory out of the 
way using a ; deleted suffix. 


5.3. Accessing Previous Versions 


We provide an ioct 1 that is used to access old versions 
of a file. The ioct1 takes as input, a name, a version, 
and a file descriptor and recovers the old version into the 
file descriptor. Internally, we perform recovery by scan- 
ning backward in the version file until we find the de- 
sired version record. Once this has been found, we scan 
forward in the version file writing the data pages of the 
file version to the user supplied file descriptor. We also 
update the attributes of the file descriptor based on the 
values recorded in the version record. Hence, previous 
version access is a redo operation. Directory operations, 
on the other hand, are undone depending on the contents 
of the version log records. 


5.4 Limitations 


The causal data that we capture is an approximation and 
can lead to false positives or negatives while perform- 
ing analysis for recovery. For example, /etc/shadow 
will be in the history of every process and file, as Login 
reads it while authenticating users. Hence even legiti- 
mate users that log in after an attack can appear to be 
causally related to the attack. The general approach to 
deal with this has been to white-list some of these files, 
i.e., ignore the causal information on some files while 
performing analysis. Further, contextual policies to ig- 
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nore some of the causal information can help improve the 
results of the analysis. For example, to construct the list 
of files needed to migrate an application, we need to use 
all the causal information as we do not want the applica- 
tion to fail on restart. However, while analyzing causal 
data during intrusion recovery, we need consider only 
those files that have been written by illegal processes. 

As with all provenance systems, PASS cannot capture 
causal dependencies external to the computer. For ex- 
ample, when a user prints a file and then makes some 
notes based on what she read, PASS cannot capture the 
dependency between the notes to the source file. PASS 
does, however, allow users and applications to annotate 
user-knowledge or application specific provenance to the 
provenance collected by PASS (with the obvious limita- 
tion that the user or application needs to take the correct 
action). 

Attackers can perform a denial of service attack by re- 
peatedly overwriting files, filling the disk with versioning 
information. While this is not different from an attacker 
filling the disk with regular files, we can do better than 
regular file systems by using the causal information to 
detect anomalous behaviour and prevent it [12, 23]. 

Because our implementation is a stackable file sys- 
tem, it is vulnerable to tampering by means of un- 
mounting it and inspecting or altering the underlying 
state. This could be prevented by using cryptography 
or by having a non-stackable implementation and using 
a securelevel-type scheme to protect raw disk de- 
vices. 

Finally, we are vulnerable to an intruder changing 
the kernel. Once the intruder has access to the kernel, 
she can change the causal information and the version- 
ing information, thus making accurate recovery impossi- 
ble. Secure Disk Systems [27], where an intruder can- 
not modify the causal or versioning data once it has been 
written to disk, helps solve the problem partially, by al- 
lowing users to recover data up to the point the system 
was subverted. This is better than a clean system install. 
Causal versioning is still useful for recovery in the cases 
where attackers do not care to cover up their tracks, such 
as when they set up a bot on a machine and abandon the 
system after a few days. 


6 Evaluation 


The goal of our evaluation is twofold. First, to quan- 
tify the overheads introduced by the different version- 
ing systems. Second, to evaluate the efficacy and per- 
formance of the different algorithms during recovery of 
files. We address these goals as follows: First, we dis- 
cuss the evaluation platform and the configurations we 
used for evaluation. In Section 6.1, we discuss the per- 
formance overheads for four benchmarks representative 


of a broad range of workloads. In Section 6.2, we discuss 
how the versioning algorithms perform during recovery. 
We ran all the benchmarks on a 3GHz Pentium 4 ma- 
chine with 512MB of RAM. The machine had a 80GB 
7200 RPM Western Digital Caviar WD800JB hard drive 
that was used to store all file system data and metadata, 
including causality. The machine was running Fedora 
Core 5 with a PASS kernel based on Linux 2.6.23.17 and 
Lasagna was stacked on Ext2. We recorded elapsed, sys- 
tem, and user times, and the amount of disk space uti- 
lized for all tests. We also recorded the wait times for 
all tests; wait time is mostly I/O time, but other factors 
such as scheduling time can also affect it. We compute 
wait time as the difference between the elapsed time and 
system+user times. We do not discuss the user time as 
it is not affected by the modifications in the kernel. We 
also ran the same benchmarks on Ext2 using those re- 
sults as a baseline. In order to separate the overhead due 
to versioning from the overhead due to causality, we also 
ran all experiments using versioning without enabling 
causality collection. We used the open-close algorithm 
for the latter experiments. We ran each experiment at 
least 5 times. In all cases, the standard deviations were 
less than 5%. 
We evaluate the system under the following configu- 
rations: 
VER: open-close versioning with no causal data 
OC: open-close versioning with causal data 
CA: Cycle-Avoidance versioning with causal data 
GF: Graph-Finesse versioning with causal data 
ALL: Version-on-every write with causal data 


6.1 Performance Overhead Results 


We ran the following four workloads to evaluate the 
versioning algorithms. 1. Linux compile, in which 
we unpack and build Linux kernel version 2.6.19.1. 
This benchmark represents a CPU-intensive workload. 
2. Postmark, that simulates the operation of an email 
server. This represents an I/O-intensive workload. We 
ran 1,500 transactions with file sizes ranging from 4 KB 
to 1 MB, with 10 subdirectories and 1500 files. 3. Mer- 
curial activity benchmark, where we start with a vanilla 
Linux 2.6.19.1 kernel and apply, as patches, each of 
the changes that we committed to our own Mercurial- 
managed source tree. This benchmark evaluates the over- 
head a user experiences in a normal development sce- 
nario, where the user works on a small subset of the files 
over a period of time; 4. A biological blast [1] workload 
that is representative of a scientific workload. The work- 
load finds protein sequences that are closely related in 
two different species. The workload formats two input 
data files with a tool called formatdb, processes the two 
files with blast, and then massages the output data with a 
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Figure 5: Linux compile elapsed time results. 


series of perl scripts. 

















Causal Data | Version Space 
VER - | 37.6MB (2.9%) 
Oc 154.5MB (12.0%) | 49.0MB (3.8%) 
CA 172.3MB (13.4%) | 54.8MB (4.3%) 
GF 154.5MB (12.0%) | 49.0MB (3.8%) 
ALL 462.6MB (35.9%) | 1.1GB (85.7%) 








Table 2: Linux compile Space overheads. All the overheads 
shown are computed as a percentage of the data in vanilla Ext2 
(1.26GB). 


Linux Compile Benchmark Results Figure 5 shows 
the elapsed time results for Linux compile and Table 2 
shows the space overhead. Plain versioning (VER) adds 
11.9% to the elapsed time and 2.9% to the space. The 
increase in elapsed time is mostly due to the additional 
writes performed to store versions, but a small portion is 
due to the fact that we use a stackable file system. For 
the OC, CA, and GF algorithms, the overheads increase 
moderately over VER to 17.1%, 18.3% and 21.3% re- 
spectively. This increase is due to the extra writes issued 
to record causal data. For this benchmark, CA and GF, 
the causality based algorithms, perform comparably to 
OC in terms of elapsed time and version space. ALL, 
as we expect, has the worst elapsed time performance 
with 57.4% overhead. The ALL overhead is a result of 
the enormous number of versions being created and the 
quantity of data necessary to do so. The system time 
also increases significantly for ALL due to the distribu- 
tor having to cache large amounts of causal data. 


Postmark Benchmark Results Figure 6 shows the 
elapsed time results for Postmark and Table 3 shows the 
space overheads. The overheads follow a pattern similar 
to the overheads for the Linux compile benchmark. VER 
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Figure 6: Postmark elapsed time results. 
Causal Data | Version Space 
VER - 1.28GB 
OC 1.8MB (0.14%) 1.28GB 
CA 1.2MB (0.09%) 1.28GB 
GF 1.9MB (0.15%) 1.28GB 
ALL 61.2MB (4.74%) 1.38GB 

















Table 3: Postmark space overheads. Causal data overheads are 
computed as a percentage of the data written in Ext2 (1.26GB). 
Postmark deletes all files it creates at the end of the benchmark. 
In versioning systems, however, no file is deleted and all un- 
linked files are retained as is. The version space column shows 
the amount of space retained at the end of each algorithm. 


has the lowest overhead at 8.2%. VER’s overhead is due 
to the extra writes to record version data and the double 
buffering in Lasagna (stackable file systems cache both 
their data pages and lower file system data pages). The 
overheads increase marginally for OC, CA, and GF to 
9%, 9%, and 9.1% respectively. The increase is marginal 
as causal information recorded is minimal and the ver- 
sion data also increases minimally from VER to OC, CA, 
and GF. Once again, ALL, with a 26.6% overhead ex- 
hibits the greatest overhead as expected. 


Mercurial Activity Results Figure 7 shows the 
elapsed time results for the Mercurial activity benchmark 
and Table 4 shows the space overhead. The performance 
overheads follow the pattern we have seen so far. How- 
ever, surprisingly, GF performs worse than even ALL 
for this benchmark. VER, CA, GE, and ALL have over- 
heads of 25.9%, 28.8%, 27.9%, 89.6%, and 61.3% re- 
spectively. The performance overheads of GF is a result 
of a very large patch combined with the way the pro- 
gram patch functions. patch works by first reading 
the patch file and the file to patch, then merges the two 
files into a temporary file, and finally renames the tem- 
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Figure 7: Mercurial activity elapsed time results. Figure 8: Blast elapsed time results. 
Causal Data Version Space 
Causal Data Version Space VER r 40KB (0.7%) 
VER - | 228.1MB (26.6%) OC 172KB (2.9%) 40KB (0.7%) 
OC 38.3MB (4.5%) | 233.4MB (27.2%) CA 176KB (3.1%) 40KB (0.7%) 
CA 28.3MB (3.3%) | 230.6MB (26.9%) GE 172KB (2.9%) 36KB (0.6%) 
GF 30.3MB (4.7%) | 233.4MB (27.2%) ALL 3.7MB (65.4%) | 14.4MB (257.4%) 
ALL 77.8MB (9.1%) | 383.3MB (44.6%) 

















Table 5: Blast workload space overheads. 


All the over- 
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Table 4: Mercurial activity space overheads. All the over- 
heads shown are computed as a percentage of the data in Ext2 
(859MB). 


porary file to the file specified in the patch. At one point 
during development, we moved from a Linux 2.6.19.1 
kernel to a Linux 2.6.23.17 kernel. This resulted in a 
large patch touching all the source files in the repository. 
This forced a single instance of patch to read and write 
all of the 20,000 files in the Linux source tree. Every 
time patch writes to a new file, GF verifies that the file 
does not form a causality-violating cycle with the files 
that patch previously read. This results in the heavy 
system time overheads. This problem could be allevi- 
ated by having patch spawn multiple processes each of 
which merges a unique subset of the files specified in the 
patch file. 

Another anomaly is that CA generates less causal data 
than OC. The explanation is that this workload generates 
a rename for every file to be patched. OC issues a freeze 
on the directory every time a directory is modified. CA, 
however, uses the causal history to determine that the 
same process is modifying the directory and eliminates 
duplicate entries. This results in CA consuming the least 
amount of both causal and version data. 


Blast Workload Results Figure 8 shows the elapsed 
time results for the blast workload and Table 5 shows the 
space overhead. The overheads for this workload fol- 
low the pattern seen in the previous workloads. The time 


heads shown are computed as a percentage of the data in Ext2 
(5.8MB). 


overhead is 1.4% for the VER, OC, CA, and GF config- 
urations. The causal data overhead is less than 3.1% and 
the version data overhead is less than 1% for VER, OC, 
CA, and GF configurations. The workload is CPU inten- 
sive and processes a small number of large files, resulting 
in the minimal overheads for VER, OC, CA, and GF. For 
ALL, the elapsed time overhead is 8.8% and the space 
overhead is 65.4% on causal data and 257.4% on version 
data. The version data overheads of ALL is due to the 
behaviour of formatdb and blast. They write data 
in chunks smaller than a page, which versions the same 
page multiple times. 


6.2 Recovery Benchmark Results 


The goal of this subsection is to answer the following 
questions: First, in scenarios where open-close is suffi- 
cient (such as the first three use cases in Section 3), do 
the (unnecessary) causality-based algorithms impose ad- 
ditional recovery overhead? Second, in scenarios where 
causality does matter (the last two use cases in Section 3), 
how do the algorithms compare in recovery time and data 
loss? 

To answer the first question, we wrote a program 
that simulates the behaviour of a worm. Worms typi- 
cally overwrite one or more files/executables (for exam- 
ple, common executables like 1s) and install some new 
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Causal Data | Version data Bytes Read Recovery Time 
11° 7h 3rd 11°” 7th 3rd 
OC 60KB 12KB - - - - - - 
CA 176KB 470.5MB | 39.15MB | 39.16MB | 39.17MB 23s | 25.2s | 27.4s 
GF 184KB 470.5MB | 39.15MB | 39.16MB | 39.15MB 23s | 24.6s | 25.8s 
ALL 76.9MB 1.97GB | 45.24MB | 53.99MB 62.76 | 214.2s | 452.88 | 689s 























Table 7: Results for recovering from the Apache simulator. All algorithms recover the same amount of data (40MB), but read in 


different amounts of data to perform the recovery. 





























Causal | Version | Recovery Bytes 

Data Space Time Read 

OC | 21.2MB | 151.6MB 173.4s | 179.1MB 

CA | 12.7MB | 150.4MB 163.2s | 163.9MB 

GF | 21.1MB | 151.9MB 173.2s | 179.1MB 

ALL} 52.8MB | 249.3MB 191.2s | 182.9MB 
Table 6: Results for recovering from a worm attack. All algo- 
rithms recover the same amount of data (161.68MB), but read 


different amounts of data to perform the recovery. 


files/programs (irchat servers being a popular choice). 
Our worm-simulator functions in a similar manner. The 
program traverses a copy of the Linux-2.6.19.1 source 
tree, overwrites some files and creates new “bad” files. 
All in all, we taint 25,600 files, writing a total of SOOMB 
of data. Table 6 shows the time taken for recovering from 
this attack by each of the versioning algorithms. Recov- 
ery is performed in two phases. In the first phase, once 
a malicious process has been identified, we traverse up 
that process’s causal data graph to determine the root 
cause of the break-in. Backtracker [11] and Taser [7] 
perform a similar analysis to determine the cause. In the 
second phase, once we know the root cause of the attack, 
we propagate down the root process descendant tree to 
identify potential victims and recover them to a version 
just before the malicious process tampered with it. The 
recovery times that we report here are the times of the 
second phase. The results show that the recovery times 
are proportional to the amount of causal and versioning 
data stored. CA has the best recovery time and ALL has 
the worst recovery time. This is despite the fact ALL re- 
covers the same amount of data as other algorithms and 
reads roughly the same amount of data as OC and GF 
to perform the recovery. ALL stores more versioning in- 
formation than the other algorithms. Hence the required 
recovery data is spread over a much larger area on the 
disk. In turn, the recovery process has to perform more 
seeks to recover the same amount of data. 


To answer the second question, we wrote a bench- 
mark that simulates the Apache vulnerability scenario 
described in Section 1. The program first creates 50 
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files and then performs the following action in a loop 
50 times. In each loop it writes 8KB to each file from 
start to end. Every n‘” iteration (4‘” in our implemen- 
tation), the program forks a helper program that reads 
a byte from each of the 50 files and communicates the 
character to the main program via a pipe. This simulates 
the behaviour of a web server opening a new connection 
on a socket. In the causality based algorithms, once the 
main process reads from the pipe, it is a causally differ- 
ent version as it has read data from a new source, 1.e., 
a new process. Hence any writes the main process per- 
forms after that creates a new version of the file. For 
this workload, OC does not copy any data in its version 
files; all the files were just created, so it considers all 
the writes to happen to version | of the files. CA and 
GF copy data to the version files on the iterations dur- 
ing which the the parent spawns a child and reads from 
the pipe. This stores 470.5MB of version data. The 
ALL algorithm copies data on every write and this adds 
up to around 2GB of data. The amount of version data is 
shown in Table 7. 


We then recover versions at various intervals to get a 
sense for how expensive it is to go back further in time in 
each algorithm. There are 12 causality events in all, cor- 
responding to the number of times a child is forked. We 
measure the time taken to recover data to a state before 
the 11°”, 7*”, and the 3" event. The 11°” corresponds to 
recovery close to the latest version, the 7°” corresponds 
to recovery two thirds of the way back, and the 3"¢ cor- 
responds to recovery close to one third of the way back. 
The results of this benchmark are shown in Table 7. With 
OC, there are no intermediate versions, so it cannot re- 
cover anything useful. CA and GF can both recover to 
a correct version and they both take the same amount of 
time to recover. They also read only the exact amount 
of data to be recovered. ALL, however, takes at least 9 
times longer than CA and GF to perform recovery, be- 
cause it has more data and has to search through a large 
amount of data to rebuild the correct version. Further, 
ALL has many more false positives that it has to filter be- 
fore deciding on the version to recover. CA and GF have 
only one version to choose. Since ALL has been version- 
ing continuously, one version of a process has multiple 
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children. For example, the 3°¢ causal event has 41,000 
children from which it has to narrow down to 50 ver- 
sions. CA and GF have only 50 children for that causal 
event. Note that the numbers presented in the table do not 
include the time used to identify the version to recover. 


6.3 Results Summary 


In most cases, CA introduces little overhead relative to 
OC, yet it provides versions in cases where OC fails 
to do so. GF performs comparably in many cases, but 
sometimes imposes high run time overheads. Version- 
on-every-write practically always performs poorly both 
in terms of space and elapsed time. For recovery, how- 
ever, CA and GF can indeed be a big win in terms of both 
the time to recover a particular version and the amount of 
data lost. 


7 Related work 


Several prior research projects have built versioning sys- 
tems. We categorize these systems by the version- 
ing algorithm that they use and discuss each class in 
turn. We begin with the version-on-every-write systems. 
CVFS [24] was designed with security in mind. Each in- 
dividual write or small metadata change (e.g., atime up- 
dates) is versioned. The research focuses on methods to 
store and access old versions efficiently. We adopted the 
CVFS approach of using a journal to store old version 
data. Wayback [3] is a user-level versioning file sys- 
tem built on the FUSE framework. On a write call, 
Wayback logs the data being overwritten to an undo log 
before completing the write. Our version file format is 
similar to that of Wayback, but Wayback versions on ev- 
ery write while we version more selectively. The Re- 
pairable file system (RFS) [29] has functionality closest 
to ours. They record both causal data and save versions. 
They, however, collect causal data and data blocks sep- 
arately, thus preventing them from taking advantage of 
causal information to version more selectively, leading 
to versioning on every write. They also have to reconcile 
the causal and versioning data using timestamps as they 
collect them separately. 

Now we discuss systems that use open-close version- 
ing. Elephant [18] is a versioning file system imple- 
mented in the FreeBSD 2.2.8 kernel. Their research fo- 
cus is on providing users with a range of version reten- 
tion policies. Versionfs [10] is a stackable versioning file 
system. Versionfs allows users to selectively version files 
and is focused on the ability to set space reclamation and 
version storage policies for files. Retention/reclamation 
policies are complementary to our work. 

As we discussed in section 1, snapshots are another ap- 
proach for versioning where an image of a file system is 


made periodically. Systems with snapshot functionality 
include AFS [13], Plan-9 [16], WAFL [9], [6], Venti [17], 
Ext3COW [15], Thresher [21], and Selective versioning 
secure disk system [27]. Skippy [20] proposes metadata 
indexing schemes that can be used to quickly lookup pre- 
vious snapshots of a database. 

Several systems have used causal data to provide var- 
ious functions. The Taser intrusion recovery system [7] 
logs all system calls and their arguments. In the event of 
administrative errors or intrusions, they perform causal 
analysis on the logged data to determine the actions that 
need to be done to recover the system. They explore vari- 
ous algorithms and policies that can be used to determine 
the exact operations to be performed during recovery. 
We can leverage all of these algorithms and policies in 
our work, applying them in an online setting. As future 
work, they plan to integrate their work with versioning 
file systems to reduce the disk space requirements and to 
improve scalability. Our work has continued where Taser 
stopped and has taken a step further by integrating both 
causal and versioning systems. BackTracker [11] logs 
all system calls and in the event of an intrusion, performs 
causality analysis to determine the root cause of an in- 
trusion. Autobash [26] is a configuration debugging tool 
that leverages causal information to limit the amount of 
testing required. 

Chapman et.al. [5], explore techniques for causal data 
pruning. Their approach for pruning is to remove dupli- 
cates (which we already perform) and factor out common 
subtrees in causal graphs. Another approach for prun- 
ing causality could be to merge the causal information of 
deleted temporary files into their causal ancestors. Space 
can also be reclaimed by deleting the versioning data of 
temporary files, where temporary files are intermediate 
nodes in a causal graph. 

Finally, a number of versioning algorithms have been 
explored by the object oriented database (OODB) com- 
munity. These algorithms are focused on aspects that 
are particular to OODBs such as “how to propagate ver- 
sion changes of sub objects to composite objects?”’, “how 
to present a consistent view in the face of updates to 
different objects?” [4], “how to version classes as they 
change” [28], etc. 


8 Conclusions 


Combining versioning and causal relationship data offers 
powerful capabilities above and beyond what each kind 
of system can do in isolation. Causality-based versioning 
ensures that we create meaningful versions of objects, 
facilitating better recovery from data-corrupting activi- 
ties under concurrent workloads. While versioning in- 
troduces overheads between 1% and 25%, adding causal 
collection on top of versioning adds only an additional 
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5-6% overhead. The Cycle-Avoidance algorithm, which 
restricts itself to considering only per-object, local in- 
formation during online operation provides superior ver- 
sioning and recovery, at cost comparable to open-close. 

Providing versioning in the context of PASS opens up 
future research possibilities in the areas of reproducibil- 
ity and archival. PASS did not previously provide the 
ability to reproduce objects on the system, because they 
do not preserve all the necessary data. However, with 
versioning, the necessary data do exist. Versioning also 
produces objects that can easily be archived, and PASS 
provides the provenance to accurately describe those ob- 
jects. 


9 Acknowledgments 


We thank Ethan Miller, our shepherd, and Margo Seltzer 
for repeated careful and thoughtful reviews of our pa- 
per. We thank Erez Zadok, Shankar Pasupathy, Jonathan 
Ledlie, and Uri Braun for their feedback on early drafts 
of the paper. We also thank Uri for validating the CA al- 
gorithm in our user level simulator. We thank the FAST 
reviewers for the valuable feedback they provided. This 
work was partially made possible thanks to NSF grant 
CNS-0614784. 


References 


[1] ALTSCHUL, S. F., GISH, W., MILLER, W., MYERS, E. W., 
AND LIPMAN, D. J. Basic local alignment search tool. Molecu- 
lar Biology 215 (1990), 403-410. 


[2] BRAUN, U., GARFINKEL, S., MUNISWAMY-REDDY, K.-K., 
HOLLAND, D. A., AND SELTZER, M. Issues in automatic prove- 
nance collection. In Proceedings of the 2006 International Prove- 
nance and Annotation Workshop (May 2006). 


[3] BRIAN CORNELL AND PETER DINDA AND FABIN BUSTA- 
MANTE. Wayback: A User-level Versioning File System for 
Linux. In Proceedings of the USENIX 2004 Annual Technical 
Conference, FREENIX Track (2004). 


[4] CELLARY, W., AND JOMIER, G. Consistency of versions in 
objects-oriented databases. In Proceedings of the Sixteenth In- 
ternational Conference on Very Large Databases (1990). 


[5] CHAPMAN, A. P., JAGADISH, H. V., AND RAMANAN, P. Ef- 
ficient provenance storage. In SIGMOD ’08: Proceedings of the 
2008 ACM SIGMOD international conference on Management of 
data (New York, NY, USA, 2008), ACM, pp. 993-1006. 


[6] CHUTANI, S., ANDERSON, O. T., KAZAR, M. L., LEVERETT, 
B. W., MASON, W. A., AND SIDEBOTHAM, R. N. The Episode 
file system. In Proceedings of the USENIX Winter 1992 Technical 
Conference (San Francisco, CA, 1992), pp. 43-60. 


[7] GOEL, A., Po, K., FARHADI, K., LI, Z., AND DE LARA, E. 
The Taser intrusion recovery system. In SOSP (2005). 


[8] HALCROw, M. A. eCryptfs: An enterprise-class encrypted 
filesystem for linux. Ottawa Linux Symposium (2005). 


[9] Hitz, D., LAu, J.. AND MALCOLM, M. File System Design for 
an NFS File Server Appliance. In Proceedings of the USENIX 
Winter Technical Conference (January 1994), pp. 235-245. 


7th USENIX Conference on File and Storage Technologies 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


(17] 





[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


[28] 


[29] 


K. MUNISWAMY-REDDY AND C. P. WRIGHT AND A. HIMMER 
AND E. ZADOK. A Versatile and User-Oriented Versioning File 
System. In Proceedings of the Third USENIX Conference on File 
and Storage Technologies (FAST 2004) (March/April 2004). 


KING, S. T., AND CHEN, P. M. Backtracking Intrusions. In 
SOSP (Bolton Landing, NY, October 2003). 


KING, S. T., MAO, Z. M., LUCCHETTI, D. G., AND CHEN, 
P. M. Enriching intrusion alerts through multi-host causality. In 
the 12th Annual Network and Distributed System Security Sym- 
posium (2005). 


KISTLER, J. J., AND SATYANARAYANAN, M. Disconnected op- 
eration in the Coda file system. In Thirteenth ACM Symposium 
on Operating Systems Principles (1991). 


MUNISWAMY-REDDY, K.-K., HOLLAND, D. A., BRAUN, U., 
AND SELTZER, M. Provenance-aware storage systems. In Pro- 
ceedings of the 2006 USENIX Annual Technical Conference. 


PETERSON, Z., AND BURNS, R. Ext3cow: A time-shifting file 
system for regulatory compliance. ACM Transactions on Storage 
1, 2 (2005), 190-212. 

QUINLAN, S. A Cached WORM File System. Software — Prac- 
tice and Experience 21, 12 (1991), 1289-1299. 


QUINLAN, S., AND DORWARD, S. Venti: a new approach to 
archival storage. In Proceedings of First USENIX conference on 
File and Storage Technologies (January 2002), pp. 89-101. 


SANTRY, D. S., FEELEY, M. J., HUTCHINSON, N. C., VEITCH, 
A. C., CARTON, R., AND OFIR, J. Deciding When to Forget 
in the Elephant File System. In Proceedings of the 17th ACM 
Symposium on Operating Systems Principles (December 1999). 


SHAH, S., SOULES, C. A. N., GANGER, G. R., AND NOBLE, 
B. D. Using provenance to aid in personal file search. In Pro- 
ceedings of the USENIX Annual Technical Conference (2007). 


SHAULL, R., SHRIRA, L., AND XU, H. Skippy: a new snapshot 
indexing method for time travel in the storage manager. In Pro- 
ceedings of the 2008 ACM SIGMOD International Conference on 
Management of Data (New York, NY, USA). 


SHRIRA, L., AND XU, H. Thresher: An efficient storage man- 
ager for copy-on-write snapshots. In Proceedings of the Usenix 
Annual Technical Conference (Boston, MA, May 2006). 


SIMMHAN, Y. L., PLALE, B., AND GANNON, D. A survey of 
data provenance in e-science. SIGMOD Rec. 34, 3 (2005), 31-36. 


SOMAYAJI, A., AND FORREST, S. Automated Response Using 
System-Call Delays. In USENIX Security Symposium (2000). 


SOULES, C. A. N., GOODSON, G. R., STRUNK, J. D., AND 
GANGER, G. R. Metadata Efficiency in Versioning File Sys- 
tems. In Proceedings of the 2nd USENIX Conference on File and 
Storage Technologies (March 2003), pp. 43-58. 

Apache httpd 1.3 vulnerabilities. http://httpd.apache. 

org/security/vulnerabilities_13.html. 

Su, Y.-Y., ATTARIYAN, M., AND FLINN, J. Autobash: im- 
proving configuration management with operating system causal- 
ity analysis. In SOSP ’07: Proceedings of Twenty-First ACM 
SIGOPS Symposium on Operating Systems Principles (New 
York, NY, USA, 2007), ACM, pp. 237-250. 

SUNDARARAMAN, S., SIVATHANU, G., AND ZADOK, E. Se- 
lective versioning in a secure disk system. In Proceedings of the 
17th USENIX Security Symposium (July-August 2008). 

TALENS, G., OUSSALAH, C., AND COLINAS, M. F. Versions 
of simple and composite objects. In VLDB ’93: Proceedings of 
the 19th International Conference on Very Large Data Bases (San 
Francisco, CA, USA, 1993). 

ZHU, N., AND CHIUEH, T.-C. Design, implementation, and 
evaluation of repairable file service. In The International Confer- 
ence on Dependable Systems and Networks (2003). 


USENIX Association 


Enabling Transactional File Access via Lightweight Kernel Extensions 


Richard P. Spillane, Sachin Gaikwad, Manjunath Chinni, and Erez Zadok 


Stony Brook University 


Abstract 


Transactions offer a powerful data-access method 
used in many databases today trough a specialized query 
API. User applications, however, use a different file- 
access API (POSIX) which does not offer transactional 
guarantees. Applications using transactions can become 
simpler, smaller, easier to develop and maintain, more 
reliable, and more secure. We explored several tech- 
niques how to provide transactional file access with min- 
imal impact on existing programs. Our first prototype 
was a standalone kernel component within the Linux 
kernel, but it complicated the kernel considerably and 
duplicated some of Linux’s existing facilities. Our sec- 
ond prototype was all in user level, and while it was 
easier to develop, it suffered from high overheads. In 
this paper we describe our latest prototype and the evo- 
lution that led to it. We implemented a transactional file 
API inside the Linux kernel which integrates easily and 
seamlessly with existing kernel facilities. This design is 
easier to maintain, simpler to integrate into existing OSs, 
and efficient. We evaluated our prototype and other sys- 
tems under a variety of workloads. We demonstrate that 
our prototype’s performance is better than comparable 
systems and comes close to the theoretical lower bound 
for a log-based transaction manager. 


1 Introduction 


In the past, providing a transactional interface to files 
typically required developers to choose from two un- 
desirable options: (1) modify complex file system code 
in the kernel or (2) provide a user-level solution which 
incurs unnecessary overheads. Previous in-kernel de- 
signs either had the luxury of designing around trans- 
actions from the beginning [33] or limited themselves 
to supporting only one primary file system [43]. Previ- 
ous user-level approaches were implemented as libraries 
(e.g., Berkeley DB [39], and Stasis [34]) and did not sup- 
port interaction through the VFS [15] with other non- 
transactional processes. These libraries also introduced 
a redundant page cache and provided no support to non- 
transactional processes. This paper presents the design 
and evaluation of a transactional file interface that re- 
quires modifications to neither existing file systems nor 
applications, yet guarantees atomicity and isolation for 
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standard file accesses using the kernel’s own page cache. 

Transactions require satisfaction of the four ACID 
properties: Atomicity, Consistency, Isolation, and Dura- 
bility. Enforcing these properties appears to require 
many OS changes, including a unified cache man- 
ager [12] and support for logging and recovery. Despite 
the complexity of supporting ACID semantics on file 
operations [30], Microsoft [43] and others [4,44] have 
shown significant interest in transactional file systems. 
Their interest is not surprising: developers are constantly 
reimplementing file cleanup and ad-hoc locking mecha- 
nisms which are unnecessary in a transactional file sys- 
tem. A transactional file system does not eliminate the 
need for locking and recovery, but by exposing an inter- 
face to specify transactional properties allows applica- 
tion programmers to reuse locking, logging, and recov- 
ery code. Defending against TOCTTOU (time of check 
till time of use) security attacks also becomes easier [28, 
29] because sensitive operations are easily isolated from 
an intruder’s operations. Security and quality guaran- 
tees for control files, such as configuration files, are be- 
coming more important. The number of programs run- 
ning on a standard system continues to grow along with 
the cost of administration. In Linux, the CUPS print- 
ing service, the Gnome desktop environment, and other 
services all store their configurations in files that can be- 
come corrupted when multiple writers access them or if 
the system crashes unexpectedly. Despite the existence 
of database interfaces, many programs still use configu- 
ration files for their simplicity, generality, and because a 
large collection of existing tools can access these simple 
configuration files. For example, Gnome stores over 400 
control files in a user’s home directory. A transactional 
file interface is useful to all such applications. 

To provide ACID guarantees, a file interface must be 
able to mediate all access to the transactional file system. 
This forces the designer of a transactional file system to 
put a large database-like runtime environment either in 
the kernel or in a kernel-like interceptor, since the ker- 
nel typically services file-system system calls. This en- 
vironment must employ abortable logging and recovery 
mechanisms that are linked into the kernel code. VFS- 
cache rollback is also required to revert an aborted trans- 
action [44], its stale inodes, dentries, and other in-kernel 
data structures. The situation can be simplified drasti- 
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cally if one abandons the requirement that the backing 
store for file operations must be able to interact with 
other transaction-oblivious processes (e.g., grep), and 
by duplicating the functionality of the page cache in 
user space. This concession is often made by transac- 
tional libraries such as Berkeley DB [39] and Stasis [34]: 
they provide a transactional interface only to a single file 
and they do not solve the complex problems of rewind- 
ing the page cache and stale in-memory structures af- 
ter a process aborts. Systems such as QuickSilver [33] 
and TxF [43] address this trade-off between the com- 
pleteness and implementation size by redesigning a spe- 
cific file system around proper support for transactional 
file operations. In this paper we show that such a re- 
design is unnecessary, and that every file system can pro- 
vide a transactional interface without requiring special- 
ized modifications. We describe our system which uses 
a seamless approach to provide transactional semantics 
using a new dynamically loaded kernel module, and only 
minor modifications to existing kernel code. Our tech- 
nique keeps kernel complexity low yet still offers a full- 
fledged transactional file interface without introducing 
unnecessary overheads for non-transactional processes. 


We call our file interface Valor. Valor relies on im- 
proved locking and write ordering semantics that we 
added to the kernel. Through a kernel module, it also 
provides a simple in-kernel logging subsystem opti- 
mized for writing data. Valor’s kernel modifications 
are small and easily separable from other kernel com- 
ponents; thus introducing negligible kernel complexity. 
Processes can use Valor’s logging and locking interfaces 
to provide ACID transactions using seven new system 
calls. Because Valor enforces locking in the kernel, it 
can protect operations that a transactional process per- 
forms from any other process in the system. Valor aborts 
a process’s transaction if the process crashes. Valor sup- 
ports large and long-living transactions. This is not pos- 
sible for ext 3, XFS, or any other journaling file system: 
these systems can only abort the entire file system jour- 
nal, and only if there is a hardware I/O error or the entire 
system crashes. These systems’ transactions must al- 
ways remain in RAM until they commit (see Section 2). 


Another advantage of our design is that it is imple- 
mented on top of an unmodified file system. This results 
in negligible overheads for processes not using trans- 
actions: they simply access the underlying file system, 
only using the Valor kernel modifications to acquire nec- 
essary locks. Using tried-and-true file systems also pro- 
vides good performance compared to systems that com- 
pletely replace the file system with a database. Valor 
runs with a statistically indistinguishable overhead on 
top of ext3 under typical loads when providing a trans- 
actional interface to a number of sensitive configuration 
files. Valor is designed from the beginning to run well 
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without durability. File system semantics accept this 
as the default, offering fsync (2) [9] as the accepted 
means to block until data is safely written to disk. Valor 
has an analogous function to provide durable commits. 
This makes sense in a file-system setting as most opera- 
tions are easily repeatable. For non-durable transactions, 
Valor’s overhead on top of an idealized mock logging 
implementation is only 35% (see Section 4). 

The rest of this paper is organized as follows. In Sec- 
tion 2 we describe previous experiences with designing 
transactional systems and related work that have led us 
to Valor. We detail Valor’s design in Section 3 and eval- 
uate its performance in Section 4. We conclude and pro- 
pose future work in Section 5. 


2 Background 


The most common approach for transactions on stable 
storage is using a relational database, such as an SQL 
server (e.g., MySQL [22]) or an embedded database li- 
brary (e.g., Berkeley DB [39]); but they have also long 
been a desired programming paradigm for file systems. 
By providing a layer of abstraction for concurrency, er- 
ror handling, and recovery, transactions enable simpler, 
more robust programs. Valor’s design was informed by 
two previous file systems we developed using Berkeley 
DB: KBDBFS and Amino [44]. Next we discuss jour- 
naling file systems’ relationship to our work, and we fol- 
low with discussions on database file systems and APIs. 


2.1 Beyond Journaling File Systems 


Journaling file systems suffer from two draw-backs: (1) 
they must store all data modified by a transaction in 
RAM until the transaction commits and (2) their journals 
are not designed to be accessed by user processes [16, 
31,42]. Journaling file systems store only enough in- 
formation to commit a transaction already stored in the 
log (redo-only record). This results in journaling file 
systems being forced to contain all data for all in-flight 
transactions in RAM [6,7,42]. For metadata transac- 
tions, which are finite in size and duration, journaling 
file systems are a convenient optimization. However, we 
wanted to provide user processes with transactions that 
could be megabytes large and run for long periods of 
time. The RAM restriction of a journaling file system is 
too limiting to support versatile file-based transactions. 
Two primary approaches were used to provide file- 
system transactions to user processes. (1) Database file 
systems provide transactions to user processes by mak- 
ing fundamental changes to the design of a standard file 
system to support better logging and rollback of inodes, 
dentries, and cached pages [33,36,43]. (2) Database 
access APIs provide transactions to user processes by 
offering a user library that exposes a transactional page 
file. Processes can store application data in the page file 
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by using library-specific API routines rather than storing 
their data on the file system [34, 39]. Valor represents an 
alternative to the above two approaches. Valor’s design 
was settled after designing KBDBFS and Amino [44]. 
We discuss KBDBFS and Amino in their proper con- 
texts in Sections 2.2 and 2.3, respectively. 


2.2 Database File Systems 


KBDBFS was an in-kernel file system built on a port of 
the Berkeley Database [39] to the Linux kernel. It was 
part of a larger project that explored uses of a relational 
database within the kernel. KBDBFS utilized transac- 
tions to provide file-system—level consistency, but did 
not export these same semantics to user-level programs. 
It became clear to us that unlocking the potential value of 
a file system built on a database required exporting these 
transactional semantics to user-level applications. KB- 
DBFS could not easily export these semantics to user- 
level applications, because as a standard kernel file sys- 
tem in Linux it was bound by the VFS to cache various 
objects (e.g., inodes and directory entries), all of which 
ran the risk of being rolled back by the transaction. To 
export transactions to user space, KBDBFS would there- 
fore be required to either bypass the VFS layers that re- 
quire these cached objects, or alternatively track each 
transaction’s modifications to these objects. The first 
approach would require major kernel modifications and 
the second approach would duplicate much of the log- 
ging that BDB was already providing, losing many of 
the benefits provided by the database. 

Our design of KBDBFS was motivated in part by a 
desire to modify the existing Linux kernel as little as 
possible. Another transaction system which modified 
an existing OS was Seltzer’s log-structured file system, 
modified to support transaction processing [37]. Seltzer 
et al’s simulations of transactions embedded in the file 
system showed that file system transactions can perform 
as well as a DBMS in disk-bound configurations [35]. 
They later implemented a transaction processing (TP) 
system in a log-structured file system (LFS), and com- 
pared it to a user-space TP system running over LFS and 
a read-optimized file system [37]. 

Microsoft’s TxF [19,43] and QuickSilver’s [33] 
database file systems leverage the early incorporation of 
transactions support into the OS. TxF exploits the trans- 
action manager which was already present in Windows. 
TxF uses multiple file versions to isolate transactional 
readers from transactional writers. TxF works only with 
NTFS and relies on specific NTFS modifications and 
how NTFS interacts with the Windows kernel. Quick- 
Silver is a distributed OS developed by IBM Research 
that makes use of transactional IPC [33]. QuickSilver 
was designed from the ground up using a microkernel 
architecture and IPC. To fully integrate transactions into 


the OS, QuickSilver requires a departure from traditional 
APIs and requires each OS component to provide spe- 
cific rollback and commit support. We wanted to al- 
low existing applications and OS components to remain 
largely unmodified, and yet allow them to be augmented 
with simple begin, commit, and abort calls for file sys- 
tem operations. We wanted to provide transactions with- 
out requiring fundamental changes to the OS, and with- 
out restricting support to a particular file system, so that 
applications can use the file system most suited to their 
work load on any standard OS. Lastly, we did not want 
to incur any overheads on non-transactional processes. 

Inversion File System [24], OdeFS [5], iFS [26], and 
DBFS [21] are database file systems implemented as 
user-level NFS servers [17]. As they are NFS servers 
(which predate NFSv4’s locking and callback capabil- 
ities [38]), the NFS client’s cache can serve requests 
without consulting the NFS server’s database; this could 
allow a client application to write to a portion of the file 
system that has since been locked by another applica- 
tion, violating the client application’s isolation. They do 
not address the problem of supporting efficient transac- 
tions on the local disk. 


2.3 Database Access APIs 


The other common approach to providing a transactional 
interface to applications is to provide a user-level li- 
brary to store data in a special page file or B-Tree main- 
tained by the library. Berkeley DB offers a B-Tree, a 
hash table, and other structures [39]. Stasis offers a 
page file [34]. These systems require applications to use 
database-specific APIs to access or store data in these 
library-controlled page files. 

Based on our experiences with KBDBFS, we chose 
to prototype a transactional file system, again built on 
BDB, but in user space. Our prototype, Amino, utilized 
Linux’s process debugging interface, pt race [8], to ser- 
vice file-system-—related calls on behalf of other pro- 
cesses, storing all data in an efficient Berkeley DB B-tree 
schema. Through Amino we demonstrated two main 
ideas. First, we revealed the ability to provide trans- 
actional semantics to user-level applications. Second, 
we showed the benefits that user-level programs gain 
when they use these transactional semantics: program- 
ming model simplification and application-level con- 
sistency [44]. Although we extended ptrace to re- 
duce context switches and data copies, Amino’s per- 
formance was still poor compared to an in-kernel file 
system for some system-call—intensive workloads (such 
as the configuration phase of a compile). Finally, al- 
though Amino’s performance was comparable to Ext3 
for metadata workloads (such as Postmark [14]), for 
data-intensive workloads, Amino’s database layout re- 
sulted in significantly lower throughput. Amino was a 
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successful project in that it validated the concept of a 
transactional file system with a user-visible transactional 
API, but the performance we achieved could not displace 
traditional file systems. Moreover, one of our primary 
goals is for transactional and non-transactional programs 
to have access to the same data through the file system 
interface. Although Amino provided binary compatibil- 
ity with existing applications, running programs through 
a ptrace monitor is not as seamless as we liked. The 
ptrace monitor had to run in privileged mode to service 
all processes, it serviced system calls inefficiently due 
to additional memory copies and context switches, and 
it imposed additional overhead from using signal pass- 
ing to simulate a kernel system call interface for appli- 
cations [44]. Other user level approaches to providing 
transactional interfaces include Berkeley DB and Stasis. 


Berkeley DB. Berkeley DB is a user library that pro- 
vides applications with an API to transactionally update 
key-value pairs in an on-disk B-Tree. We discuss Berke- 
ley DB’s relative performance in depth in Section 4. We 
benchmark BDB through Valor’s file system extensions. 
Relying on BDB to perform file system operations can 
result in large overheads for large serial writes or large 
transactions (256MiB or more). This is because BDB is 
being used to provide a file interface, which is used by 
applications with different work-loads than applications 
that typically use a database. If the regular BDB in- 
terface is used, though, transaction-oblivious processes 
cannot interact with transactional applications, as the 
formed use the file system interface directly. 


Stasis. Stasis provides applications a transactional in- 
terface to a page file. Stasis requires that applications 
specify their own hooks to be used by the database to 
determine efficient undo and redo operations. Stasis sup- 
ports nested transactions [7] alongside write-ahead log- 
ging and LSN-Free pages [34] to improve performance. 
Stasis does not require applications to use a B-Tree on 
disk and exposes the page file directly. Like BDB, Sta- 
sis requires applications to be coded against its API to 
read and write transactionally. Like BDB, Stasis does 
not provide a transactional interface on top of an exist- 
ing file system which already contains data. Also like 
BDB, Stasis implements its own private, yet redundant 
page cache which is less efficient than cooperating with 
the kernel’s page cache (see Section 4). 

Reflecting on our experience with KBDBFS and 
Amino, we have come to the conclusion that adapting 
the file system interface to support ACID transactions 
does indeed have value and that the two most valu- 
able properties that the database provided to us were 
the logging and the locking infrastructure. Therefore, 
in Valor we provide two key kernel facilities: (1) ex- 
tended mandatory locking and (2) simple write order- 
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ing. Extended mandatory locking lets Valor provide the 
isolation that in our previous prototypes was provided 
by the database’s locking facility. Simple write ordering 
lets Valor’s logging facility use the kernel’s page cache 
to buffer dirty pages and log pages which reduces re- 
dundancy, improves performance, and makes it easier to 
support transactions on top of existing file systems. 


3 Design and Implementation 


The design of Valor prioritizes (1) a low complexity ker- 
nel design, (2) a versatile interface that makes use of 
transactions optional, and (3) performance. Our seam- 
less approach achieves low complexity by exporting just 
a minimal set of system calls to user processes. Func- 
tionality exposed by these system calls would be difficult 
to implement efficiently in user-space. 

Valor allows applications to perform file-system op- 
erations within isolated and atomic transactions. Iso- 
lation guarantees that file-system operations performed 
within one transaction have no impact on other pro- 
cesses. Atomicity guarantees that committing a trans- 
action causes all operations performed in it to be per- 
formed at once, as a unit inseparable even by a sys- 
tem crash. If desired, Valor can ensure a transaction 
is durable: if the transaction completes, the results are 
guaranteed to be safe on disk. We now turn to Valor’s 
transactional model, which specifies the scope of these 
guarantees and what processes must do to ensure they 
are provided. 


Transactional Model. Valor’s transactional guaran- 
tees extend to the individual inodes and pages of di- 
rectories and regular files for reads and writes. A pro- 
cess must lock an entire file if it will read from or write 
to its inode. Appends and truncations modify the file 
size, so they also must lock the entire file. To overwrite 
data in a file, only the affected pages need to be locked. 
When performing directory operations like file creation 
and unlinking, only the containing directory needs to be 
locked. When renaming a directory, processes must also 
recursively lock all of the directory’s descendants. This 
is the accepted way to handle concurrent lockers dur- 
ing a directory rename [27]. More sophisticated lock- 
ing schemes (e.g., intent locks [3]) that improve per- 
formance and relieve contention among concurrent pro- 
cesses are beyond the scope of this paper. 

We now turn to the concepts underlying Valor’s archi- 
tecture. These concepts are implemented as components 
of Valor’s system; they are illustrated in Figure 1. 


1. Logging Device. In order to guarantee that a se- 
quence of modifications to the file system completes as 
a unit, Valor must be able to undo partial changes left 
behind by a transaction that was interrupted by either a 
system crash or a process crash. This means that Valor 
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Figure 1: Valor Architecture 


must store some amount of auxiliary data, because an 
unmodified file system can only be relied upon to atomi- 
cally update a single sector and does not provide a mech- 
anism for determining the state before an incomplete 
write. Common mechanisms for storing this auxiliary 
data include a log [7] and WAFL [13]. Valor does not 
modify the existing file system, so it uses a log stored on 
a separate partition called the log partition. 


2. Simple Write Ordering. Valor relies on the fact 
that even if a write to the file system fails to complete, 
the auxiliary information has already been written to the 
log. Valor can use that information to undo the partial 
write. In short, Valor needs to have a way to ensure that 
writes to the log partition occur before writes to other 
file systems. This requirement is a special case of write 
ordering, in which the page cache can control the order 
in which its writes reach the disk. We discuss our im- 
plementation in Section 3.1, which we call simple write 
ordering both because it is a special case and because it 
operates specifically at page granularity. 


3. Extended Mandatory Locking. Isolation gives a 
process the illusion that there are no other concurrently 
executing processes accessing the same files, directories, 
or inodes. Transactional processes can implement this 
by first acquiring a lock before reading or writing to a 
page in a file, a file’s inode, or a directory. However, 
an OS with a POSIX interface and pre-existing appli- 
cations must support processes that do not use transac- 
tions. These transaction-oblivious processes do not ac- 
quire locks before reading from or writing to files or 
directories. Extended mandatory locking ensures that 
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all processes acquire locks before accessing these re- 
sources. See Section 3.2. 


4. Interception Mechanism. New applications can 
use special APIs to access the transaction functionality 
that Valor provides; however, pre-existing applications 
must be made to run correctly if they are executed in- 
side a transaction. This could occur if, for example, a 
Valor-aware application starts a transaction and launches 
a standard shell utility. To do this, Valor modifies the 
standard POSIX system calls used by unmodified appli- 
cations to perform the locking necessary for proper iso- 
lation. Section 3.3 describes our modifications. 

The above four Valor components provide the neces- 
sary infrastructure for the seven Valor system calls. Pro- 
cesses that desire transactional semantics must use the 
Valor system calls to log their writes and acquire locks 
on files. We now discuss the Valor system calls and then 
provide a short example to illustrate Valor’s basic oper- 
ation. 


Valor’s Seven System Calls. When an application 
uses the following seven system calls correctly (e.g., 
calling the appropriate system call before writing to a 
page), Valor provides that application fully transactional 
semantics. This is true even if other user-level applica- 
tions do not use these system calls or use them incor- 
rectly. 


Log Begin begins a transaction. This must be called 
before all other operations within the transaction. 

Log Append logs an undo-redo record, which stores 
the information allowing a subsequent operation to 
be reversed. This must be called before every oper- 
ation within the transaction. See Section 3.1. 

Log Resolve ends a transaction. In case of an error, a 
process may voluntarily abort a transaction, which 
undoes partial changes made during that transac- 
tion. This operation is called an abort. Conversely, 
if a process wants to end the transaction and en- 
sure that changes made during a transaction are all 
done as an atomic unit, it can commit the transac- 
tion. Whether a log resolve is a commit or an 
abort depends on a flag that is passed in. 

Transaction Sync flushes a transaction to disk. A 
process may call Transaction Sync to ensure 
that changes made in its committed transactions 
are on disk and will never be undone. This is the 
only sanctioned way to achieve durability in Valor. 
O_DIRECT, O_SYNC, and fsync [9] have no useful 
effect within a transaction for the same reason that 
nested transactions cannot be durable: the parent 
transaction has yet to commit [7]. 

Lock, Lock Permit,Lock Policy Our Lock sys- 
tem call locks a page range in a file, an entire di- 
rectory, or an entire file with a shared or exclusive 
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lock. This is implemented as a modified fcnt1. 
These routines provide Valor’s support for transac- 
tional isolation. Lock Permit and Lock Policy 
are required for security and inter-process transac- 
tions, respectively. See Section 3.2. 


Cooperating with the Kernel Page Cache. As illus- 
trated in Figure 1, the kernel’s page cache is central to 
Valor, and one of Valor’s key contributions is its close 
cooperation with the page cache. In systems that do 
not support transactions, the write (2) system call ini- 
tiates an asynchronous write which is later flushed to 
disk by the kernel page cache’s dirty-page write-back 
thread. In Linux, this thread is called pdflush [1]. 
If an application requires durability in this scenario, 
it must explicitly call Esync(2). Omitting durability 
by default is an important optimization which allows 
pdf lush to economize on disk seeks by grouping writes 
together. Databases, despite introducing transaction se- 
mantics, achieve similar economies through No-Force 
page caches. These caches write auxiliary log records 
only when a transaction commits, and then only as one 
large serial write, and use threads similar to pdflush 
to flush data pages asynchronously [7]. Valor is also 
No-Force, but can further reduce the cost of commit- 
ting a transaction by writing nothing—neither log pages 
nor data pages—until pdflush activates. Valor’s sim- 
ple write ordering scheme facilitates this optimization by 
guaranteeing that writes to the log partition always oc- 
cur before the corresponding data writes. In the absence 
of simple write ordering, Valor would be forced to im- 
plement a redundant page cache, as many other systems 
do. Valor implements simple write ordering in terms of 
existing Linux fsync semantics which returns when the 
writes are scheduled, but before they hit the disk plat- 
ter. This introduces a short race where applications run- 
ning on top of Valor and the other systems we evaluated 
(Berkeley DB, Stasis, and ext3) could crash unrecover- 
ably. Unfortunately, this is the standard fsync imple- 
mentation and impacts other systems such as MySQL, 
Berkeley DB, and Stasis [45] which rely on fsync or its 
like (1.e., fdatasync, O_SYNC, and direct-IO). 


One complexity introduced by this scheme is that a 
transaction may be completely written to the log, and re- 
ported as durable and complete, but its data pages may 
not yet all be written to disk. If the system crashes in 
this scenario, Valor must be able to complete the disk 
writes during recovery to fulfill its durability guarantee. 
Similar to database systems that also perform this opti- 
mization, Valor includes sufficient information in the log 
entries to redo the writes, allowing the transaction to be 
completed during recovery. 


Another complexity is that Valor supports large trans- 
actions that may not fit entirely in memory. This means 
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that some memory pages that were dirtied during an in- 
complete transaction may be flushed to disk to relieve 
memory pressure. If the system crashes in this scenario, 
Valor must be able to rollback these flushes during re- 
covery to fulfill its atomicity guarantee. Valor writes 
undo records describing the original state of each af- 
fected page to the log when flushing in this way. A page 
cache that supports flushing dirty pages from uncommit- 
ted transactions is known as a Steal cache; XFS [41], 
ZFS [40], and other journaling file systems are No-Steal, 
which limits their transaction size [42] (see Section 2). 
Valor’s solution is a variant of the ARIES transaction re- 
covery algorithm [20]. 


An Example. Figure 2 illustrates Valor’s writeback 
mechanism. A process P initially calls the Lock sys- 
tem call to acquire access to two data pages in a file, 
then calls the Log Append system call on them, gener- 
ating the two ’L’s in the figure, and then calls write (2) 
to update the data contained in the pages, generating the 
two ’P’s in the figure. Finally, it commits the transac- 
tion and quits. The processes did not call transaction 
sync. On the left hand side, the figure shows the state of 
the system before P2 commits the transaction; because 
of Valor’s non-durable No-Force logging scheme, data 
pages and corresponding undo/redo log entries both re- 
side in the page cache. On the right hand side, the pro- 
cess has committed and exited; simple write ordering 
ensures that the log entries are safely resident on disk, 
and the data pages will be written out by pdflush as 
needed. 
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We now discuss each of Valor’s four architectural 
components in detail. Section 3.1 discusses the log- 
ging, simple write ordering, and recovery components of 
Valor. Section 3.2 discusses Valor’s extended mandatory 
locking mechanism, and Section 3.3 explains Valor’s in- 
terception mechanism. 
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3.1 The Logging Interface 


Valor maintains two logs. A _ general-purpose log 
records information on directory operations, like adding 
and removing entries from a directory, and inode op- 
erations, like appends or truncations. A page-value 
log records modifications to individual pages in regular 
files [2]. Before writing to a page in a regular file (dirty- 
ing the page), and before adding or removing a name 
from a directory, the process must call Log Append to 
prepare the associated undo-redo record. We refer to 
this undo-redo record as a log record. Since the bulk 
of file system I/O is from dirtying pages and not direc- 
tory operations, we have only implemented Valor’s page 
log for evaluation. Valor manages its logs by keeping 
track of the state of each transaction, and tracking which 
log records belong to which transactions. 


General Purpose Log 
Page Value Log 


Page Cache 


Inflight, Landed, Freeing Lists 


1 Page 
(transition) 





Figure 3: Valor Log Layout 


3.1.1 In-Memory Data Structures 


There are three states a transaction can be in during the 
course of its life: (1) in-flight, in which the applica- 
tion has called Log Begin but has not yet called Log 
Resolve; (2) landed, in which the application has called 
Log Resolve but the transaction is not yet safe to deal- 
locate; and (3) freeing, in which the transaction is ready 
to be deallocated. Landed is distinct from freeing be- 
cause if an application does not require durability, Log 
Resolve causes neither the log nor the data from the 
transaction to be flushed to disk (see above, Cooperat- 
ing with the Kernel Page Cache). 

Valor tracks a transaction by allocating a commit set 
for that transaction. A commit set consists of a unique 
transaction ID and a list of log records. As depicted 
in Figure 3, Valor maintains separate lists of in-flight, 
landed, and freeing commit sets. It also uses a radix tree 
to track free on-disk log records. 


USENIX Association 


Life Cycle of a Transaction. When a process calls 
Log Begin, it gets a transaction ID by allocating a new 
log record, called a commit record. Valor then creates an 
in-memory commit set and moves it onto the inflight list. 
During the lifetime of the transaction, whenever the pro- 
cess calls Log Append, Valor adds new log records to 
the commit set. When the process calls Log Resolve, 
Valor moves its commit set to the landed list and marks 
it as committed or aborted depending on the flag passed 
in by the process. If the transaction is committed, Valor 
writes a magic value to the commit record allocated dur- 
ing Log Begin. If the system crashes and the log is 
complete, the value of this log record dictates whether 
the transaction should be recovered or aborted. 


One thing Valor must be careful about is the case in 
which a log record is flushed to disk by pdflush, its 
corresponding file page is updated with a new value, and 
the file system containing that file page writes it to disk, 
thus violating write ordering. To resolve this issue, Valor 
keeps a flag in each page in the kernel’s page cache. This 
flag can read available or unavailable; between the time 
Valor flushes the page’s log record to the log and the 
time the file system writes the dirty page back to disk, 
it is marked as unavailable, and processes which try to 
call Log Append to add new log records wait until it 
becomes available, thus preserving our simple write or- 
dering constraint. For hot file-system pages (e.g., those 
containing global counters), this could result in bursty 
write behavior. One possible remedy is to borrow Ext3’s 
solution: when writing to an unavailable page, Valor can 
create a copy. The original copy remains read-only and 
is freed after the flush completes. The new copy is used 
for new reads and writes and is not flushed until the next 
pdflush, maintaining the simple write ordering. 


We modified pdf 1ush to maintain Valor’s in-memory 
data structures and to obey simple write ordering by 
flushing the log’s super block before all other super 
blocks. When pdflush runs, it (1) moves commit sets 
which have been written back to disk to the freeing list, 
(2) marks all page log records in the inflight and landed 
lists as unavailable, (3) atomically transitions the disk 
state to commit landed transactions to disk, and (4) it- 
erates through the freeing list to deallocate transactions 
which have been safely written back to disk. 


Soft vs. Hard Deallocations. Valor deallocates log 
records in two situations: (1) when a Log Append fails 
to allocate a new log record, and (2) when pdflush 
runs. Soft deallocation waits for pdf1lush to naturally 
write back pages and moves a commit set to the freeing 
list to be deallocated once all of its log records have had 
their changes written back. Hard deallocation explicitly 
flushes a landed commit set’s dirty pages and directory 
modifications so it can immediately deallocate it. 


7th USENIX Conference on File and Storage Technologies 35 


36 


3.1.2. On-Disk Data Structures 


Figure 3 shows the page-value log and general-purpose 
log. Valor maintains two record map files to act as su- 
perblocks for the log files, and to store which log records 
belong to which transactions. One of these record map 
files corresponds to the general-purpose log, and the 
other to the page-value log. For a given log, there are 
exactly the same number of entries in the record map 
as there are log records in the log. The five fields of a 
record map entry are: 


Transaction ID The transaction (commit set) this log 
record belongs to. 

Log Sequence Number (LSN) Indicates when this log 
record was allocated. 

inode Inode of the file whose page was modified. 

netid Serial number of the device the inode resides on. 

offset Offset of the page that was modified. 


General-purpose log records contain directory path 
names for recovery of original directory listings in case 
of a crash. Page value log records contain a specially- 
encoded page to store both the undo and the redo record. 
The state file is part of the mechanism employed by 
Valor to ensure atomicity. It is described in Section 3.1.3 
along with Valor’s atomic flushing procedure. 


Transition Value Logging. Although the undo-redo 
record of an update to a page could be stored as the 
value of the page before the update and the value af- 
ter, Valor instead makes a reasonable optimization in 
which it stores only the XOR of the value of the page 
before and after the update. This is called a transition 
page. Transition pages can be applied to either recover 
or abort the on-disk image. A pitfall of this technique 
is that idempotency is lost [7]; Valor avoids this prob- 
lem by recording the location and value of the first bit 
of each sector in the log record that differed between 
the undo and redo image. Although log records are al- 
ways page-sized, this information must be stored on a 
per-sector basis as the disk may only write part of the 
page. (Because meta-data is stored in a separate map, 
transition pages in the log are all sector-aligned.) If a 
transaction updates the same page multiple times, Valor 
forces each Log Append call to wait on the Page Avail- 
able flag which is set by the simple write ordering com- 
ponent operating within pdflush. If it does not have to 
wait, the call may update the log record’s page directly, 
incurring no I/O. However, if the call must wait, then a 
new log record must be made to ensure recoverability. 


3.1.3. LDST: Log Device State Transition 


Valor’s in-memory data structures are a reflection of 
Valor’s on-disk state; however, as commit sets and log 
records are added, Valor’s on-disk state becomes stale 
until the next time pdflush runs. We ensure that 
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pdflush performs an atomic transition of Valor’s on- 
disk state to reflect the current in-memory state, thus 
making it no longer stale. To represent the previous and 
next state of Valor’s on-disk files, we have a stable and 
unstable record map for each log file. The stable record 
maps serve as an authoritative source for recovery in the 
event of a crash. The unstable record maps are updated 
during Valor’s normal operation, but are subject to cor- 
ruption in the event of crashes. The purpose of Valor’s 
LDST is to make the unstable record map consistent, and 
then safely and atomically relabel the stable record maps 
as unstable and vice versa. This is similar to the scheme 
employed by LFS [32, 37]. 

The core atomic operation of the LDST is a pointer 
update, in which Valor updates the state file. This file is a 
pointer to the pair of record maps that is currently stable. 
Because it is sector-aligned and less than one sector in 
size, a write to it is atomic. All other steps ensure that 
the record maps are accurate at the point in time where 
the pointer is updated. The steps are as follows: 


1. Quiesce (block) all readers and writers to any on- 
disk file in the Valor partition. 

2. Flush the inodes of the page-value and general- 
purpose log files. This flushes all new log records 
to disk. Log records can only have been added, so a 
crash at this point has no effect as the stable records 
map does not point to any of the new entries. 

3. Flush the inodes of the unstable page-value and 
general-purpose record map files. 

4. Write the names of the newly stable record maps to 
the state file. 

5. Flush the inode of the state file. The up-to-date 
record map is now stable, and Valor now recovers 
from it in case of a system crash. 

6. Copy the contents of the stable (previously unsta- 
ble) record map over the contents of the unstable 
(previously stable) record map, bringing it up to 
date. 

7. Un-quiesce (unblock) readers and writers. 

8. Free all freeing log records. 


Atomicity. The atomicity of transactions in Valor fol- 
lows from two important constraints which Valor en- 
sures that the OS obeys: (1) that writes to the log par- 
tition and data partitions obey simple write ordering and 
(2) that the LDST is atomic. At mount time, Valor runs 
recovery (Section 3.1.4) to ensure that the log is ready 
and fully describes the on-disk system state when it is 
finished mounting. Thereafter, all proper transactional 
writes are preceded by Log Append calls. No writes go 
to disk until pdf£1ush is called or Valor’s Transaction 
Sync is called. Simple write ordering ensures that in 
both cases, the log records are written before the in- 
place updates, so no update can reach the disk unless its 
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corresponding log record has already been written. Log 
records themselves are written atomically and safely be- 
cause writes to the log’s backing store are only made 
during an LDST. Since an LDST is atomic, the state of 
the entire system advances forward atomically as well. 


3.1.4 Performing Recovery 


System Crash Recovery. During the mount opera- 
tion, the logging device checks to see if there are any 
outstanding log records and, if so, runs recovery. Dur- 
ing umount, the Logging Device flushes all commit- 
ted transactions to disk and aborts all remaining trans- 
actions. Valor can perform recovery easily by reading 
the state file to determine which record map for each log 
is stable, and reconstructing the commit sets from these 
record maps. A log sequence number (LSN) stored in 
the record map allows Valor to read in reverse order the 
events captured within the log and play them forward or 
back based on whether the write needs to be completed 
to satisfy durability or rolled back to satisfy atomicity. 
Recovery finds all record map entries and makes a com- 
mit set for each of them which is by default marked as 
aborted. While traversing through record map entries if 
it finds a record map entry with a magic value (written 
asynchronously during Log Resolve) indicating that 
this transaction was committed, it marks that set com- 
mitted. Finally all commit sets are deallocated and an 
LDST is performed. The system can come on line. 


Process Crash Recovery. Recovery handles the case 
of a system crash, something handled by all journaling 
file systems. However, Valor also supports user-process 
transactions and, by extension, user-process recovery. 
When a process calls the do_exit process clean-up rou- 
tine in the kernel, their task_struct is checked to see 
if a transaction was in-flight. If so, then Valor moves the 
commit set for the transaction onto the landed list and 
marks the commit set as aborted. 


3.2 Ensuring Isolation 


Extended mandatory locking is a derivation of manda- 
tory locking, a system already present in Linux and So- 
laris [10, 18]. Mandatory locks are shared for reads but 
exclusive for writes and must be acquired by all pro- 
cesses that read from or write to a file. Valor adds 
these additional features: (1) a locking permission bit 
for owner, group, and all (LPerm), (2) a lock policy sys- 
tem call for specifying how locks are distributed upon 
exit, and (3) the ability to lock a directory (and the re- 
quirement to acquire this lock for directory operations). 
System calls performed by non-transactional processes 
that write to a file, inode, or directory object acquire 
the appropriate lock before performing the operation and 
then release the lock upon returning from the call. Non- 
transactional system calls are consequently two-phase 


with respect to exclusive locks and well-formed with re- 
spect to writes. Thus Valor provides degree | isolation. 
In this environment, then, by the degrees of isolation the- 
orem [7], transactional processes that obey higher de- 
grees of isolation can have transactions with repeatable 
reads (degree 3) and no lost updates (degree 2). 

Valor supports inter-process transactions by imple- 
menting inter-process locking. Processes may specify 
(1) if their locks can be recursively acquired by their 
children, and (2) if a child’s locks are released or instead 
given to its parent when the child exits. These specifica- 
tions are propagated to the Extended Mandatory Lock- 
ing system with the Lock Policy system call. 

Valor prevents misuse of locks by allowing a pro- 
cess to acquire a lock only under one of two circum- 
stances: (1) if the process has permission to acquire a 
lock on the file according to the LPerm of the file, or 
(2) if the process has read access or write access, de- 
pending on the type of the lock. Only the owner of a 
file can change the LPerm, but changes to the LPerm 
take effect regardless of transactions’ isolation seman- 
tics. Deadlock is prevented using a deadlock-detection 
algorithm. If a lock would create a circular dependency, 
then an error is returned. Transaction-aware processes 
can then recover gracefully. Transaction-oblivious pro- 
cesses should check the status of the failed system call 
and return an error so that they can be aborted. This 
works well in practice. We have successfully booted, 
used, and shutdown a previous version of the Valor sys- 
tem with extended mandatory locking and the standard 
legacy programs. A related issue is the locking of fre- 
quently accessed file-system objects or pages. The de- 
fault Valor behavior is to provide degree 1 isolation, 
which prevents another transaction from accessing the 
page while another transaction is writing to it. For 
transaction-oblivious processes, because each individ- 
ual system call is treated as a transaction, these locks 
are short lived. For transaction-aware processes, an ap- 
propriate level of isolation can be chosen (e.g., degree 
2—no lost updates) to maximize concurrency and still 
provide the required isolation properties. 


3.3 Application Interception 


Valor supports applications that are aware of trans- 
actions but need to invoke subprocesses that are not 
transaction-aware within a transaction. Such a subpro- 
cess is wrapped in a transaction that begins when it 
first performs a file operation and ends when it exits. 
This is useful for a transactional process that forks sub- 
processes (e.g., grep) to do work within a transaction. 
During system calls, Valor checks a flag in the process 
to determine whether to behave transactionally or not. 
In particular, when a process is forked, it can specify if 
its child is transaction-oblivious. If so, the child has its 
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Transaction ID set to that of the parent and its in-flight 
state set to Oblivious. When the process performs any 
system call that constitutes a read or a write on a file, 
inode, or directory object, the in-flight state is checked, 
and an appropriate Log Append call is made with the 
Transaction ID of the process. 


4 Evaluation 


Valor provides atomicity, isolation, and durability, but 
these properties come at a cost: writes between the log 
device and other disks must be ordered, transactional 
writes incur additional reads and writes, and in-memory 
data structures must be maintained. Additionally, Valor 
is designed to provide these features while only requir- 
ing minor changes to a standard kernel’s design. In 
this section we evaluate the performance of our Valor 
design and also compare it to stasis and BDB. Sec- 
tion 4.1 describes our experimental setup. Section 4.2 
analyzes a benchmark based on an idealized ARIES 
transaction logger to derive a lower bound on overhead. 
Section 4.3 evaluates Valor’s performance for a serial 
file overwrite. Section 4.4 evaluate Valor’s transaction 
throughput. Section 4.5 analyzes Valor’s concurrent per- 
formance. Finally, Section 4.6 measure Valor’s recovery 
time. All benchmarks test scalability. 


4.1 Experimental Setup 


We used four identical machines, each with a 2.8GHz 
Xeon CPU and 1GB of RAM for benchmarking. Each 
machine was equipped with six Maxtor DiamondMax 10 
7,200 RPM 250GB SATA disks and ran CentOS 5.2 with 
the latest updates as of September 6, 2008. To ensure a 
cold cache and an equivalent block layout on disk, we 
ran each iteration of the relevant benchmark on a newly 
formatted file system with as few services running as 
possible. We ran all tests at least five times and com- 
puted 95% confidence intervals for the mean elapsed, 
system, user, and wait times using the Student’s-t dis- 
tribution. In each case, unless otherwise noted, the half 
widths of the intervals were less than 5% of the mean. 
Wait time is elapsed time less system and user time and 
mostly measures time performing I/O, though it can also 
be affected by process scheduling. We benchmarked 
Valor on the modified Valor kernel, and all other systems 
on a stable unmodified 2.6.25 Linux kernel. 


Comparison to Berkeley DB and Stasis. The most 
similar technologies to Valor are Stasis and Berkeley 
DB (BDB): two user level logging libraries that provide 
transactional semantics on top of a page store for trans- 
actions with atomicity and isolation and with or with- 
out durability. Valor, Stasis, and BDB were all config- 
ured to store their logs on a separate disk from their 
data, a standard configuration for systems with more 


7th USENIX Conference on File and Storage Technologies 


than one disk [7]. The logs used by Valor, Stasis, and 
BDB were set to 128MiB. Since Valor prioritizes non- 
durable transactions, we configured Stasis and BDB to 
also use non-durable transactions. This configuration 
required modifying the source code of Stasis to open 
its log without O_SYNC mode. Similarly, we configured 
BDB’s environment with DB_TXN_WRITE_NOSYNC. The 
ext3 file system performs writes asynchronously by de- 
fault. For file-system workloads it is important to be 
able to perform efficient asynchronous serial writes, so 
non-durable transactions performing asynchronous se- 
rial writes were the focus during our benchmarking. 
BDB indexed each page in the file by its page offset 
and file ID (an identifier similar to an inode number). 
We used the B-Tree access method as this is the suitable 
choice for a large file system [44]. 


4.2 Mock ARIES Lower Bound 


Figure 4 compares Valor’s performance against a mock 
ARIES transaction system to see how close Valor comes 
to the ideal performance for its chosen logging sys- 
tem. We configured a separate logging block device with 
ext 2, in order to avoid overhead from unnecessary jour- 
naling in the file system. We configured the data block 
device with ext3, since journaled file systems are in 
common use for file storage. We benchmarked a 2GiB 
file overwrite under three mock systems. MT-ow-noread 
performed the overwrite by writing zeros to the ext2 
device to simulate logging, and then writing zeros to the 
ext3 device to simulate write back of dirty pages. MT- 
ow differs from MT-ow-noread in that it copies a pre- 
existing 2GiB data file to the log to simulate time spent 
reading in the before image. MT-ow-finite differs from 
the other mock systems in that it uses a 128MiB log, 
forcing it to break its operation into a series of 128MiB 
copies into the log file and writes to the data file. A trans- 
action manager based on the ARIES design must do at 
least as much I/O as MT-ow-finite. Valor’s overhead on 
top of MT-ow-finite is 35%. Stasis’s is 104%. The cost 
of MT-ow reading the before images as measured by the 
overhead of MT-ow on MT-ow-noread is only 2%. The 
cost of MT-ow-finite restricting itself to a finite log is 
16% due to required additional seeking. Stasis’s over- 
head is more than Valor’s overhead due to maintaining a 
redundant page cache in user space. 


43 Serial Overwrite 


In this benchmark we measure the time it takes for a 
process to transactionally overwrite an existing file. File 
transfers are an important workload for a file system. 
See Figure 5. Providing transactional protection to large 
file overwrites demonstrates Valor’s ability to scale with 
larger workloads and handle typical file system opera- 
tions. Since there is data on the disk already, all sys- 
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Figure 5: Async serial overwrite of files of varying sizes 


tems but ext3 must log the contents of the page before 
overwriting it. The transactional systems use a trans- 
action size of 16 pages. The primary observation from 
these results is that each system scales linearly with re- 
spect to the amount of data it must write. Valor runs 
2.75 times longer than ext3, spending the majority of 
that overhead writing Log Records to the Log Device. 
Stasis runs 1.75 times slower than Valor. It spends ad- 
ditional time allocating pages in user space for its own 
page cache, and doing additional memory copies for its 
writes to both its log and its store file. For the 512 MiB 
over write of Valor and Stasis, and the 256 MiB over 
write of Stasis the half-widths were 11%, 7%, and 23% 
respectively. The asynchronous nature of the benchmark 
caused Valor and Stasis’ page cache to introduce fluctu- 
ations in an otherwise stable serial write. BDB’s on-disk 
B-Tree format, which is very different from Stasis’s and 
Valor’s simple page-based layout, makes it difficult to 
perform well in this I/O intensive workload that has little 
need for log(n) B-Tree lookups. Because of this Valor 
runs 8.22 times the speed of BDB. 
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Figure 6: Run times for transactions, increasing granularity 


4.4 Transaction Granularity 


We measured the rate for processing small durable trans- 
actions with varying transaction sizes. This benchmark 
establishes Valor’s ability to handle many small transac- 
tions, and indicates the overhead of beginning and com- 
mitting a transaction on a write. We measured BDB, 
Valor, Stasis, and ext3. For ext3 we simply used the 
native page size on the disk. See Figure 6. The through- 
put benchmark is simply the overwrite benchmark from 
Section 4.3, but we vary the size of the transaction rather 
than the amount of data to write. We see the typical 
result that the non-transactional system (ext3) is un- 
affected: transactional systems converge on a constant 
factor of the non-transactional system’s performance as 
the overhead of beginning and committing a transaction 
approaches zero. BDB converges on a factor of 23 of 
ext3’s elapsed time, Stasis converges on a factor of 4.2, 
and Valor converges on a factor of 2.9. It is interest- 
ing that Valor has an overhead of 76% with respect to 
Stasis, and Stasis has an overhead of 25% with respect 
to BDB for single page transactions. BDB is oriented 
toward small transactions making updates to a B-Tree, 
not serial I/O. As the granularity decreases, Stasis and 
BDB converge to less efficient constant factors of the 
non-transactional ext 3’s performance than what Valor 
converges to. This would imply that Valor’s overhead for 
Log Append is lower than Stasis’s since Valor operates 
from within the kernel and eliminates the need for a re- 
dundant page cache. For one page transactions BDB has 
already converged to a constant factor of ext 3s perfor- 
mance starting at l-page transactions: for transactions 
less than one page in size, BDB began to perform worse. 


4.5 Concurrent Writers 


Concurrency is an important measure of how a file inter- 
face can handle seeking and less memory while writing. 
One application of Valor would be to grant atomicity to 
package managers which may unpack large packages in 
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accessing different files 


parallel. To measure concurrency we ran varying num- 
bers of processes that would each serially overwrite an 
independent file concurrently. Each process wrote 1GiB 
of data to its own file, and we ran the benchmark with 2, 
4, 8, and 16 processes running concurrently. Figure 7 il- 
lustrates the results of our benchmark. For low numbers 
of processes (2, 4, and 8) BDB had half-widths of 35%, 
6%, and %5 because of the high variance introduced by 
BDB’s user space page cache. Stasis and BDB run at 
2.7 and 7.5 times the elapsed time of ext3. For the 2, 
4, 8, and 16 process cases, Valor’s elapsed time is 3.0, 
2.6, 2.4, and 2.3 times that of ext3. What is notable is 
that these times converge on lower factors of ext3 for 
high numbers of concurrent writers. The transactional 
systems must perform a serial write to a log followed by 
a random seek and a write for each process. BDB and 
Stasis must maintain their page caches, and BDB must 
maintain B-Tree structures on disk and in memory. For 
small numbers of processes, the additional I/O of writ- 
ing to Valor’s log widens the gap between transactional 
systems and ext3, but as the number of processes and 
therefore the number of files being written to at once in- 
creases, the rate of seeks overtakes the cost of an extra 
log serial write for each data write, and maintenance of 
on-disk or in-memory structures for BDB and Stasis. 


4.6 Recovery 


One of the main goals of a journaling file system is to 
avoid a lengthy fsck on boot [11]. Therefore it is im- 
portant to ensure Valor can recover from a crash in a 
finite amount of time with respect to the disk. Valor’s 
ability to recover a file after a crash is based on its log- 
ging an equivalent amount of data during operation. The 
amount of total data that Valor must recover cannot ex- 
ceed the length of Valor’s log, which was 128 MiB in all 
our benchmarks. Valor’s recovery process consists of: 
(1) reading a page from the log, (2) reading the original 


7th USENIX Conference on File and Storage Technologies 























Wait ——4 
User (SSssy 
Systen = 
Se 15F7 
oO 
a 
oO 
E 
FE 1 0.3 0.5 0.5 0.8 
TD 
oO 
n 
Qa 
x 
we 05 F 
0 _ 





% 2% 
Be, Vy 
6, 


xX 2 
Oy 76. i Ye 
° ® 


& 
Qo, i 
® 8 


& 
2. 
oe 


Figure 8: Time spent recovering from a crash for varying 
amounts of uncommitted data and varying number of processes 


page on disk, (3) determining whether to roll forward 
or back, and (4) writing to the original page if neces- 
sary. To see how long Valor took to recover for a typical 
amount of uncommitted data, we tested the recovery of 
8MiB, 16MiB and 32MiB of uncommitted data. In the 
first trial, two processes were appending to separate files 
when they crashed, and their writes had to be rolled back 
by recovery. In the second trial, three processes were 
appending to separate files. Process crash was simu- 
lated by simply calling exit (2) and not committing the 
transaction. Valor first reads the Record Map to reconsti- 
tute the in-memory state at the time of crash, then plays 
each record forward or back in reverse Log Sequence 
Number (LSN) order. Figure 8 illustrates our recovery 
results. Label 2/8-rec in the figure shows elapsed time 
taken by recovery to recover 8MiB of data in the case of 
2 process crash. We see that although the amount of time 
spent recovering is proportional to the amount of uncom- 
mitted data for both the 2 and 3 process case, that recov- 
ering 3 processes takes more time than for 2 because of 
additional seeking back and forth between pages on disk 
associated with log records for 3 uncommitted transac- 
tions instead of 2. 2/32-rec is 2.31 times slower than 
2/16-rec and 2/16-rec is 1.46 times slower than 2/8-rec 
due to varying size of recoverable data. Similarly, 3/32- 
rec is 2.04 times slower than 3/16-rec and 3/16-rec is 1.5 
times slower than 3/8-rec. Keeping the amount of recov- 
erable data same we see that 3 processes have 44%, 63%, 
and 60% overhead compared to 2 process with recover- 
able data of 8MiB, 16MiB, 32MiB, respectively. In the 
worst case, Valor recovery can become a random read of 
128MiB of log data, followed by another random read of 
128MiB of on-disk data, and finally 128MiB of random 
writes to roll back on-disk data. 


Valor does no logging for read-only transactions (e.g., 
getdents, read) because they do not modify the file 
system. Valor only acquires a read lock on the pages be- 
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ing read, and, because it calls directly down into the file 
system to service the read request, there is no overhead. 


Systems which use an additional layer of software to 
translate file system operations into database operations 
and back again introduce additional overhead. This is 
why Valor achieves good performance with respect to 
other database-based user level file system implemen- 
tations that provide transactional semantics. These al- 
ternative APIs can perform well in practice, but only if 
applications use their interface, and constrain their work- 
loads to reads and writes that perform well in a standard 
database rather than a file system. Our system does not 
have these restrictions. 


5 Conclusions 


Applications can benefit greatly from having a POSIX- 
compliant transactional API that minimizes the number 
of modifications needed to applications. Such appli- 
cations can become smaller, faster, more reliable, and 
more secure—as we have demonstrated in this and prior 
work. However, adding transaction support to existing 
OSs is hard to achieve simply and efficiently, as we had 
explored ourselves in several prototypes. 

This paper has several contributions. First, we de- 
scribe two older prototypes and designs for file-based 
transactions: (1) KBDBFS which attempted to port a 
standalone BDB library and add file system support 
into the Linux kernel—adding over 150,000 complex 
lines-of-code to the kernel, duplicating much effort; (2) 
Amino, which moved all that functionality to user level, 
making it simpler, but incurring high overheads. 

The second and primary contribution of this paper is 
our design of Valor, which was informed by our previous 
attempts. Valor runs in the kernel cooperating with the 
kernel’s page cache, and runs more efficiently: Valor’s 
performance comes close to the theoretical lower bound 
for a log-based transaction manager, and scales much 
better than Amino, BDB, and Stasis 

Unlike KBDBFS, however, Valor integrates seam- 
lessly with the Linux kernel, by utilizing its existing fa- 
cilities. Valor required less than 100 LoC changes to 
pdflush and another 300 LoC to simply wrap system 
calls; the rest of Valor is a standalone kernel module 
which adds less than 4,000 LoC to the stackable file sys- 
tem template Valor was based on. 


Future Work. One of our eventual goals is to explore 
the use of Log Structured Merge Trees [25] to optimize 
our general purpose log and provide faster name lookups 
(e.g. decreasing the elapsed time of find). 

Another interesting research direction is to use 
NFSv4’s compound calls to implement network-based 
file transactions [38]. This may require semantic change 
to NFSv4 so as to not allow partial success of some op- 


erations within a compound, and to allow the NFS server 
to perform atomic updates to its back-end storage. 

Finally we intend to further investigate the ramifica- 
tions of weakening fsync semantics in light of current 
trends in hard drive write cache design. We want to ex- 
plore the possibility of extending asynchronous barrier 
writes based on native command queueing to the user 
level layer so that systems which use atomicity mecha- 
nisms across multiple devices (e.g., via a logical volume 
manager or multiple mounts) can retain atomicity. We 
believe we could avoid hard drive cache flushes [23] us- 
ing tagged I/O support for SATA drives and export this 
write ordering primitive to layers higher than the block 
device and file system implementation. We also are in- 
terested in analyzing the probability of failure when us- 
ing varying semantics for fsync as well as analyzing 
the associated performance trade-offs. 
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Abstract 


Customer problem troubleshooting has been a crit- 
ically important issue for both customers and system 
providers. This paper makes two major contributions to 
better understand this topic. 

First, it provides one of the first characteristic stud- 
ies of customer problem troubleshooting using a large 
set (636,108) of real world customer cases reported from 
100,000 commercially deployed storage systems in the 
last two years. We study the characteristics of cus- 
tomer problem troubleshooting from various dimensions 
as well as correlation among them. Our results show that 
while some failures are either benign, or resolved auto- 
matically, many others can take hours or days of man- 
ual diagnosis to fix. For modern storage systems, hard- 
ware failures and misconfigurations dominate customer 
cases, but software failures take longer time to resolve. 
Interestingly, a relatively significant percentage of cases 
are because customers lack sufficient knowledge about 
the system. We observe that customer problems with at- 
tached system logs are invariably resolved much faster 
than those without logs. 

Second, we evaluate the potential of using storage 
system logs to resolve these problems. Our analysis 
shows that a failure message alone is a poor indicator 
of root cause, and that combining failure messages with 
multiple log events can improve low-level root cause pre- 
diction by a factor of three. We then discuss the chal- 
lenges in log analysis and possible solutions. 


1 Introduction 
1.1 Motivation 


There has been a lot of effort, both academic and com- 
mercial [12, 22, 29, 35, 36, 46], put into building robust 
systems over the past two decades. Despite this, prob- 
lems always occur at customer sites. Customers usually 
report such problems to system vendors who are then re- 
sponsible for diagnosing and fixing the problems. Rapid 
resolution of customer problems is critical for two rea- 
sons. First, failures in the field result in costly downtime 
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for customers. Second, these problems can be very ex- 
pensive for system vendors in terms of customer support 
personnel costs. 


A recent study indicates that problem diagnosis re- 
lated activity is 36-43% of TCO (total cost of owner- 
ship) in terms of support costs [17]. Additionally, down- 
time can cost a customer 18-35% of TCO [17]. The 
system vendor pays a price as well. A survey showed 
that vendors devote more than 8% of total revenue and 
15% of total employee costs on technical support for cus- 
tomers [52]. The ideal is to automate problem resolution, 
which can occur in seconds and essentially costs $0. 


Unfortunately, customer problem troubleshooting is 
very challenging because modern computing environ- 
ments consist of multiple pieces of hardware and soft- 
ware that are connected in complex ways. For exam- 
ple, a customer running an application, which uses a 
database on a storage system, might complain about poor 
performance, but without sophisticated diagnostic infor- 
mation, it is often difficult to tell if the root cause is 
due to the application, network switches, database, or 
storage system. Individual components such as stor- 
age systems are themselves composed of many intercon- 
nected modules, each of which has its own failure modes. 
For example, a storage system failure can be caused by 
disks, physical interconnects, shelves, RAID controllers, 
etc [4, 5, 27, 47, 28]. Furthermore a large fraction of 
customer problems tend to be human generated miscon- 
figuration [46] or operator mistakes [43]. 


In all these cases, there is a problem symptom (e.g. 
system failure) and a problem root cause (e.g. disk shelf 
failure). The goal of customer problem troubleshooting 
is to rapidly identify the root cause from the problem 
symptom, and apply the appropriate fix such as a soft- 
ware patch, hardware replacement or configuration cor- 
rection. In some cases the fix is simply to clear a cus- 
tomer’s wrong expectation. 

It is standard practice for software and hardware 
providers today to build-in the capability to record im- 
portant system events in logs [51, 44]. Despite the 
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widespread existence of logs, there is limited research 
on the use of logs to troubleshoot system misbehavior 
or failures. For IP network systems, some fault localiza- 
tion studies use log events to observe the network link 
failures, while the core diagnosis algorithms rely on the 
dependency models describing the relationship between 
link failures and network component faults [32, 30, 3]. 
For other systems, such a priori knowledge is usually 
lacking. Other research using logs deals with intrusion 
detection and security auditing [1, 19]. In industry, Log- 
logic [39] and Splunk [25] provide solutions to help mine 
logs for patterns or specific words. While useful, they do 
not automate system fault diagnosis. 

In this paper, we explore the use of storage sys- 
tem logs to troubleshoot customer problems. We start 
by characterizing the nature of customer problems, and 
measuring problem resolution time with and without logs 
(Sections 2 and 3). We then evaluate the extent to which 
a problem symptom alone can help narrow the possi- 
ble cause of the problem (Section 4). Finally, we study 
the challenges in using logs to accurately obtain prob- 
lem root cause information (Section 5) and briefly outline 
some ideas we have for automated log analysis (Section 
5.3). We are currently evaluating these ideas in a sys- 
tem we are building for fully automated customer trou- 
bleshooting from logs. 


1.2 Our Findings 


Providing meaningful, quantitative answers to the 
questions we want to explore is a challenging task since it 
requires analysis of hundreds or thousands of real world 
customer cases and system logs. We speculate the lack 
of availability of such a data set is one of the reasons for 
the absence of studies in this area. 

We had access to three structured databases at NetApp 
containing a wealth of information about customer cases, 
relevant system logs, and engineering analysis of the cus- 
tomer problems. 

Using this data, our work makes two major contri- 
butions. First, it provides one of the first characteris- 
tic studies of customer problem troubleshooting using 
a large set (636,108) of real world cases from 100,000 
commercially deployed storage systems produced by Ne- 
tApp. We study the characteristics of customer problem 
troubleshooting from various dimensions including dis- 
tribution of root causes, impact, problem resolution time 
as well as correlation among them. We evaluate the fea- 
sibility and challenges of using logs to resolve customer 
problems and outline a potential automatic log analysis 
technique. 

We have the following major findings: 


(1) Problem troubleshooting is a time-consuming and 
challenging task. While we observed that 36% of re- 
ported problems are benign and automatically resolved, 
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Problem root cause 


Sat Apr 15 05:58:15 a ee encountered an unexpected bus phase. Issuing SCSI bus reset. 
Sat Apr 15 05:59:10 EST [fs.warn]: volume /vol/vol1 is low on free space. 98% in use. ——_____, ss 
Sat Apr 15 06:01:10 EST [fs.warn]: volume /vol/vol10 is low on free space. 99% in use. > Log noise 
Sat Apr 15 06:02:14 EST [raidDiskRecovering]: Attempting to bring device 9a back into service. —» RAID reti 
Sat Apr 15 06:02:14 EST [raidDiskRecovering]: Attempting to bring device 9b back into service. —” ry 


Sat Apr 15 06:07:19 EST [timeoutError]: device 9a did not respond to requested I/O. I/O will be retried, 
Sat Apr 15 06:07:19 EST [noPathsError]: No more paths to device 9a: All retries have failed. 
Sat Apr 15 06:07:19 EST [timeoutError]: device 9b did not respond to requested I/O. I/O will be retried. 
Sat Apr 15 06:07:19 EST [noPathsError]: No more paths to device 9b. All retries have failed. —, 

Sat Apr 15 06:08:23 EST [filerUp]: Filer is up and running. 


sjuane Bo] 90} JO [BIOL 





Sat Apr 15 06:24:07 EST@panicALERT]: Panic String: File system hung in process idle_thread1 —> Critical event 


Problem symptom 


Figure 1. A sample asup message log. The 
problem symptom is a panic. The root cause is a SCSI 
bus bridge error. For this root cause, the log has some 
noise, i.e. events that are not connected with this case. 


the remainder required expensive manual intervention 
that can take a long time. 


(2) Hardware failures (40%) and misconfigurations 
(21%) dominate customer cases. Software bugs account 
for a small fraction (3%) but can cause significant down- 
time and take much longer to resolve. 


(3) A significant percentage of customer problems 
(11%) are because customers lack sufficient knowledge 
about the system, which leads to misconfiguring the op- 
erating environment. 


(4) More than 87% of problems have low impact be- 
cause they are handled by built-in failure tolerance mech- 
anisms such as RAID-DP® [16]. While high-impact 
problems are much fewer, they are much more difficult to 
troubleshoot due to complex interactions between system 
modules and the multiple failure modes of these mod- 
ules. 


(5) An important finding is that customer cases with 
available system log messages invariably have a shorter 
(16-88%) problem resolution time than cases that don’t 
have logs. 


(6) Critical events in logs, which capture the failure 
symptoms, can help identify the high-level problem cat- 
egory, such as hardware problem, misconfiguration prob- 
lem, etc. However, on their own, critical events are not 
enough to identify a more precise problem root cause 
which is necessary to resolve the customer problem. 


(7) Combining critical events with multiple other log 
events can improve the problem root cause prediction by 
3x, except for misconfigurations which tend to have too 
many noisy, unrelated log events. 


(8) Logs are challenging to analyze manually. They 
contain a lot of log noise, due to messages logged by 
modules that are not related to the problem. Often log 
messages are fuzzy as well. This calls for an intelligent 
log analysis tool to filter out log noise and accurately cap- 
ture a problem signature. 


While we believe that many of our findings can be 
generalized to other system providers, especially storage 
system providers, we would still like to caution readers 
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to take our dataset and evaluation methodology into con- 
sideration when interpreting and using our results. 


2 Data Sources and Methodology 


In this section, we describe how customer cases are 
created and resolved, and the use of system logs in this 
process. We also discuss how we select case and log data 
for analysis. 


2.1 The AutoSupport System 


The AutoSupport system [33] consists of infrastruc- 
ture built into the operating system to log system events 
and to collect and forward these events to a database. 
While customers can choose if they want to forward 
these messages to the storage company, in practice most 
do so since it allows for proactive system monitoring and 
faster customer support. 

Asup messages (autosupport messages) are sent both 
on a periodic basis and also when critical events oc- 
cur. Periodic messages contain aggregated information 
for the week such as average CPU utilization, number 
of read and write I/Os, ambient temperature, disk space 
utilization etc. Critical events consist of warning mes- 
sages or failure messages. Warnings, such as a volume 
being low on space, can be used for proactive resolution. 
A failure message, such a system panic or disk failure is 
diagnosed and fixed, after it is reported. 

Every asup message contains a unique id that identi- 
fies the system that generated the message, the reason for 
the message, and any additional data that can help such 
as previously logged system events, system configuration 
etc. 


2.2 An Example Scenario and Terminology 


Figure 1 shows a sample asup message log. At the 
very end of the log is a critical event which is a message 
showing there was a file system panic that halted the sys- 
tem. Critical events can be either failure messages or 
warning messages. The critical event contains a problem 
symptom, in this case the system panic, which is what 
the customer observes as the problem. 

Notice that every module in the system logs its own 
messages, and this is part of what makes log analysis 
very difficult. There is often a lot of log noise, which 
is what we call log messages that are not relevant to the 
current problem. As we see in Figure 1, there are over 
100 messages in a short span of time, most of which are 
not relevant to the problem symptom. 

In this example, we see that various components be- 
low the file system, including, the RAID and the SCSI 
layer, log their own failure messages. From our analysis, 
we determined that the problem root cause was a SCSI 
bus failure which is logged 106 events before the prob- 
lem symptom. 


Therefore, manually inspecting these logs can be time 
consuming. Furthermore, manual inspection requires a 
good understanding of the interactions between various 
software layers. In this example, the person resolving the 
case from logs would need to realize that the SCSI bus 
failure makes disks unavailable which in turn caused the 
file system to panic to prevent further writes that could 
not be safely written to disk. 


Resolutions 






Support Center 








Field Problems (64%) 





Human-Generated 








Auto-Generated 


Warnings (35%) 








Support Staff i 


Automatic Hardware 
Replacement (1%) 


Automatic System Panic Diagnosis (~0.1%) 


Figure 2. Flowchart of the customer sup- 
port system 


2.3 How Customer Cases are Created 


Customer cases are created either automatically or 
manually. For every asup message that is received by 
the company, a rule-engine is applied to determine if a 
customer case should be created in the customer sup- 
port database. We refer to these cases as auto-generated 
cases. Such cases have a problem symptom, which is the 
asup failure or warning message that led to the case be- 
ing opened. For example, a system panic is a symptom 
that always results in the creation of a customer case. 

Human-generated cases are those that are created di- 
rectly by the customer, either over the phone or by email. 
These often include performance problems which are 
difficult to detect and log automatically. 

Figure 2 illustrates how customer cases are generated 
and resolved in the customer support system. 


2.4 Customer Case Resolution 


Auto-generated customer cases are either manually 
resolved or automatically resolved. In Figure 2, 35% 
of customer cases are filtered out by the system since 
they are warnings that have no immediate customer im- 
pact. For 1% of customer cases, for example a disk fail- 
ure, the resolution is to automatically ship a replacement 
part. 0.1% of customer cases are system panics that were 
automatically resolved by comparing the panic message 
and stack back traceto a knowledge-base and pointing the 
customer to appropriate fix. 

In our study, we focus only on human-generated and 
auto-generated cases that are manually resolved since 
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these are the ones that are most expensive both in terms 
of downtime and financial cost to the customer and the 
storage system company. 


2.5 Data Selection 


We now describe how we selected customer case data 
for analysis in later sections of this paper. There are 
two primary databases that were used. The first is a 
Customer Support Database that contains details on ev- 
ery customer case that was human-generated or auto- 
generated. Certain problems that cannot be resolved 
by customer support staff are escalated to engineering 
teams, who also record such problems in an Engineering 
Case Database. 

We analyzed 636,108 customer cases from the Cus- 
tomer Support Database over the period 1/1/2006 to 
1/1/2008. Of these 329,484 customer cases were 
human-generated and 306,624 customer cases were auto- 
generated. Overall these represent about 100,000 storage 
systems. 

For each of these 636,108 customer cases, problem 
category and resolution time are retrieved from the Cus- 
tomer Support Database. For each of the 306,624 auto- 
generated customer cases, we also retrieved the critical 
event that led to the creation of the case. However, the 
human-generated cases do not have such information. 

The goal for resolving any customer case is to deter- 
mine the problem root case as soon as possible. Since 
such information in the Customer Support Database is 
unstructured, it was difficult to identify problem root 
cause for solved cases. However, the Engineering Case 
Database records problem root cause at a fine level. We 
used 4,769 such cases that were present in both the Cus- 
tomer Support as well as Engineering Case database to 
analyze problem root cause and its correlation with criti- 
cal events. 

To study the correlation between problem root cause 
and storage system logs, we retrieve the AutoSupport 
logs from the AutoSupport Database. Since not all cus- 
tomer systems send AutoSupport logs to the company, 
among 4,769 customer cases, 4,535 customer cases have 
corresponding AutoSupport log information. 


2.6 Generality of our study 


Although our study is based on customer service 
workflow at NetApp, we believe it is quite representative. 
As defined in ITIL [57], this customer service workflow 
represents a typical troubleshooting sequence: a problem 
case is opened by a call to the help center or by an alert 
generated by a monitoring system, followed by diagno- 
sis by support staff. A similar process is followed by 
IBM customer service as described in [24]. Moreover, 
the comprehensive environment of the storage systems, 
gives us an opportunity to study a mixture of hardware, 
software and configuration problems. 
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Figure 3. Cumulative Distribution Function 
(CDF) of resolution time for all customer 
cases. ! There is wide variance in problem resolution 
time, with some cases taking days to solve. 


3 Characteristics of Field Problems 
3.1 Problem Resolution Time 


One of the most important metrics of customer sup- 
port is problem resolution time, which is time spent be- 
tween when a case is opened and when the resolution or 
workaround is available for a customer. The distribution 
of problem resolution times is the key to understanding 
the complexity of a specific problem or problem class, 
since it mostly reflects the amount of time spent on trou- 
bleshooting problems. It is important to notice that it 
should not be directly used to calculate MTTR (Mean 
Time To Recovery), since it does not capture the amount 
of time to completely solve the problems (e.g., for hard- 
ware related problems, it does not include hardware re- 
placement or when it is scheduled to minimize the impact 
for users). 

Figure 3 shows the Cumulative Distribution Function 
(CDF) of resolution time for all customer cases selected 
from the Customer Support Database. It is possible for 
troubleshooting to take many hours. For a small fraction 
of cases, resolution time can be even longer. Since the 
x-axis of the figure is logarithmic, the graph shows that 
doubling the amount of time spent on problem resolution 
does not double the number of cases resolved. While the 
Autosupport logging system is an important step in help- 
ing troubleshoot problems, this figure makes the case that 
better tools and techniques are needed to reduce problem 
resolution time. 


3.2 Problem Root Cause Categories 


Analyzing the distribution of problem root causes is 
useful in understanding where one should spend effort 
when troubleshooting customer cases or designing more 
robust systems. While a problem root cause is precise, 
such as a SCSI bus failure, in this section we lump root 
causes into categories such as hardware, software, mis- 
configuration, etc. For all the customer cases, we study 


! We anonymize results to preserve confidentiality and anonymity. 
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(a) Categorization of Problem Root Causes 
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(b) Average Resolution Time per Problem Root Cause Category! 


Figure 4. Problem Root Cause Category. Hardware Failure is related to problems with hardware components, 
such as disk drive. Software Bug is related to storage system software, and Misconfiguration is related to system problems 
caused by errors in configuration. User Knowledge is related to technical questions, e.g., explaining why customers were 
seeing certain system behaviors. Customer Environment is related to problems not caused by storage system itself. The 
figures shows that hardware failures and misconfiguration problems are the major root causes, but software bugs took longer 


time to resolve. 
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Figure 5. Resolution Time Spent on Prob- 
lem Root Cause by Category. Although soft- 
ware problems take longer time to resolve on average, 
hardware failure and misconfiguration related problems 
have greater impact on customer experience. 


resolution time for each category, relative frequency of 
cases in each category, and the cost which is the average 
resolution time multiplied by the number of cases for that 
category. 


As Figure 4 (a) shows, hardware failures and miscon- 
figuration are the two most frequent problem root cause 
categories, and contribute 40% and 21% to all customer 
cases, respectively. Software bugs account for a small 
fraction (3%) of cases. We speculate that software bugs 
are not that common since software undergoes rigorous 
tests before being shipped to customers. Besides tests, 
there are many techniques [12, 29, 35, 36] that can be ap- 
plied to find bugs in software. While on average, based 
on figure 4 (b), software bugs take a longer time to re- 
solve, since their number is so small their overall impact 
on total time spent on all problem resolutions is not very 
high, as Figure 5 shows. 


USENIX Association 


It is interesting to observe that a relatively significant 
percentage of customer problems are because customers 
lack sufficient knowledge about the system (11%) or cus- 
tomers’ own execution environments are incorrect (9%) 
(e.g. a backup failure caused by a Domain Name Sys- 
tem error). These problems can potentially be reduced by 
providing more system training programs or better con- 
figuration checkers. 

Figure 4 (b) is our first indication that logs are in- 
deed useful in reducing problem resolution time. Auto- 
generated customer cases i.e. those with an attached sys- 
tem log and problem symptom in the form of a critical 
event message, take less time to resolve than human- 
generated cases. The latter are often poorly defined over 
the phone or by email. The only instance where this is 
not true is when the problem relates to the customer’s en- 
vironment, which is difficult to record via an automated 
system. 


3.3. Problem Impact 


In the previous subsections, we have treated all prob- 
lems as equal in their impact on customers. We now con- 
sider customer impact for each problem category. To do 
this, we divide customer cases into 6 categories based on 
impact ranging from system crash which is the most seri- 
ous, to low impact unhealthy status. The other categories 
from higher to lower impact are usability (e.g. inability 
to access a volume), performance, hardware component 
failure, and unhealthy status (e.g., instability of the in- 
terconnects, low spare disk count). Hardware failures 
typically have low impact since the storage systems are 
designed to tolerate multiple disk failures [16], power- 
supply failures, filer head failures etc. However, until 
the failed component is replaced, the system operates in 
degraded mode where the potential for complete system 
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Figure 6. Problem Impact. * From the left to the right, it is in the order of higher impact to lower impact on customer 
experience. Although the problems with higher impact happen much less frequently compared to the problems with lower 


impacts, they are usually more complicated to resolve. 


failure exists, should its redundant component fail. 

Since human-generated customer cases do not have 
all impact information in structured format, we randomly 
sampled 200 human-generated cases and manually ana- 
lyzed them. For auto-generated problems, we include all 
the cases, and leverage the information in Customer Sup- 
port Database. 

For both human-generated and auto-generated cases, 
the classification is exclusive: each problem case is clas- 
sified to one and only one category. The classification is 
based on how a problem impacts customers’ experience. 
For example, a disk failure that led to a system panic will 
be classified as an instance of System Crash. If it did not 
lead to system crash (i.e. RAID handled it) it is classi- 
fied as an instance of Hardware Component Failure. It 
is important to notice that, in our study the Performance 
problems are problem cases that lead to unexpected per- 
formance slowdown. Therefore disk failures leading to 
expected slowdown with RAID reconstruction processes 
are Classified as Hardware Component Failures, instead 
of Performance problems. 

Figure 6 (a) shows the distribution of problems by 
impact. One obvious observation is that there are far 
fewer high-impact problems than low-impact ones. More 
specifically, system crash only contributes about 3%, and 
usability problems contribute about 10%. Low impact 
problems such as hardware component failure and un- 
healthy status contribute about 44% and 20%, respec- 
tively. 

While high-impact problems are much fewer, as Fig- 
ure 6 (b) shows, they are more time consuming to trou- 
bleshoot. This is due to the complex interaction between 
system modules. For example, the problem shown in 
Figure | resulted in a system crash. The root cause was 
an error in the SCSI bus bridge. This started a chain of 
recovery mechanisms in layers of software, including re- 
tries by the RAID layer and SCSI layer. As the result, the 
time from the root cause to system failure is about a half 
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Figure 7. Case Generation Method and Res- 
olution Time.! Auto-Generated problems are re- 
solved faster than Human-Generated problems. 


hour, and there are more than 100 log events in between 
the critical event and problem root cause. This makes 
manual diagnosis of such problems difficult, even when 
logs are available. 

Finally, as we observed in the previous section, auto- 
generated cases take less time to resolve than human- 
generated ones. 


3.4 Customer case generation method 


As we mentioned in Section 2, 51.6% customer cases 
were human-generated and 48.4% were auto-generated. 
We now look at how these two methods impact resolution 
time. 

Figure 7 shows that resolution time for auto-generated 
and human-generated customer cases is similar in dis- 
tribution: both show huge variance in time. On the 
other hand, auto-generated cases were solved faster than 
human-generated ones. 

One possible reason why auto-generated cases can be 
resolved faster than human-generated ones is that auto- 
generated cases contain valuable information such as 


2«System Crash” here means crash of single system, which might 
not lead to service downtime with a cluster configuration. 
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Figure 8. Critical Events can partially help infer high-level problem root causes. The distribution 
of customer cases across problem root cause categories, for 20 of the most common critical events. 4,769 auto-generated 
customer cases that contain detailed root cause diagnosis were selected from the Engineering Case Database for this analysis. 


critical events, which capture problem symptoms. In ad- 
dition, information on prior failures or warnings is avail- 
able in the system’s logs. 

In comparison, human-generated problems are usu- 
ally sent with vague descriptions, which vary from one 
person to another and this information does not have the 
same rigorous structure as auto-generated ones. 

Similar trends have been observed in Figure 4(b). 
Across all problem root cause categories, auto-generated 
cases take 16-88% less resolution time than human- 
generated cases. The only exception is Customer 
Environment cases, where auto-generated and human- 
generated cases take similar average resolution time. 

4 Can Critical Events Help Infer Root 
Causes? 

Having established that customer cases with attached 
system logs result in improved problem resolution time, 
we now ask if critical events in the logs can be directly 
used to identify problem root cause. To remind the 
reader, a critical event is a special kind of log message 
that contains a problem symptom, and triggers the au- 
tomatic opening of a customer case via the Autosupport 
system. An example of such an event is a system panic 
log message. 


4.1 High-level Problem Root Causes 


We first look at the relationship between critical 
events and high-level problem root cause categories: 
hardware failure, software bug, and misconfiguration. 
We do not present the results for the other two problem 
root cause categories (user knowledge and customer en- 
vironment) because they are often human-generated and 
rarely have a clear critical event in the system log. 

Figure 8 shows the distribution of customer cases 
amongst the three high-level root cause categories for the 
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Case A 


Sun Aug 5 08:26:39 CDT [downloadRequest]: newer system software download requested. 
Sun Aug 5 08:29:38 CDT [downloadRequestDone]: download complete. 

Sun Aug 5 08:34:36 CDT [raidLabelUpgrade]: upgrade RAID labels. 

Sun Aug 5 08:34:56 CDT [diskLabelBroken]: device 1 has a broken label. 

Sun Aug 5 08:34:56 CDT [diskLabelBroken]: device 2 has a broken label. 


Sun Aug 5 08:37:42 CDT [raidVolumeFailure: ALERT]: RAID volume 1 has failed. 
Case B 


Wed Jan 14 09:41:13 CET [raidDiskinsert]: device 7 inserted. 
Wed Jan 14 09:42:57 CET [raidMissingChild]: RAID object 0 only has 1 child, expecting 18. 
Wed Jan 14 09:44:05 CET [raidVolumeFailure: ALERT]: RAID volume 2 has failed. 


Figure 9. Two real-world customer cases 
with the same critical event: RAID Volume 
Failure but different root causes. Case A was 
caused by a software bug: large-capacity disks, which 
were previously used in degraded-mode (not used in full 
capacity), were used in full capacity after a software 
upgrade. However, due to a software bug, disk labels 
could not be correctly recognized and multiple broken 
labels led to a RAID Volume Failure Message. Case B 
was caused by misconfiguration: customers mistakenly 
inserted non-zeroed disk into the system, leading to a 
RAID Volume Failure Message. 


20 most frequent critical events. For this experiment, we 
selected those auto-generated customer cases from the 
Customer Support Database that were also in the Engi- 
neering Case Database, so that we could relate each cus- 
tomer case to its detailed engineering diagnosis. 


As seen in Figure 8, for several critical events, there 
is a dominant high-level problem root cause. For ex- 
ample, 91% of customer cases with critical event 10 (a 
Misconfiguration Warning Message) were obviously di- 
agnosed as misconfiguration problems, and 95% of cus- 
tomer cases with critical event 11 (a Hardware Failure 
Warning Message) were diagnosed as hardware failure 
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Figure 10. Critical Events cannot infer module-level problem root causes. 


Case C 


Tue Feb 21 19:00:01 EST [FibreChannelUnstable]: indicates loop stability problem. 
Tue Feb 21 19:27:25 EST |timeoutError]: device 4a did not respond to requested I/O. I/O will be retried. 
Tue Feb 21 19:27:35 EST |timeoutError]: device 4a did not respond to requested I/O. I/O will be retried. 


Tue Feb 21 19:28:46 EST [noPathsError]: No more paths to device 4a. All retries have failed. 
Tue Feb 21 19:29:03 EST [diskFailure: ALERT]: device 4a has failed. 


Case D 


Fri May 19 18:38:29 CEST [ioReassignFail]: device 5a sector 140392917 reassign failed. 

Fri May 19 18:38:34 CEST [ioReassignFail]: device 5a sector 140392918 reassign failed. 

Fri May 19 18:38:40 CEST [ioReassignFail]: device 5a sector 140392919 reassign failed. 

Fri May 19 18:39:17 CEST [thresholdMediumError]: device 5a has crossed the medium error threshold. 
Fri May 19 18:39:53 CEST [diskFailure: ALERT]: device 5a has failed. 


Figure 11. Two real-world customer cases 
with the Disk Failure Message. Customer case C 
was caused by Fibre Channel loop instability and cus- 
tomer case D was caused by disk medium errors. 


problems. This is not surprising, since these critical 
event messages have clear semantic meaning. 

However, some critical events cannot be easily cate- 
gorized to one dominant high-level problem root cause. 
One example is critical event 07 (a RAID Volume Failure 
Message). Among customer cases with critical event 07, 
51% cases were diagnosed as hardware failure related, 
16% cases were diagnosed as caused by misconfigura- 
tion, and 33% cases were diagnosed as caused by soft- 
ware bugs. 

To better understand why there is not always a 1-1 
mapping between critical event and root cause category, 
we pick (Figure 9) two real-world auto-generated cus- 
tomer cases, which were both triggered by the same crit- 
ical event: RAID Volume Failure. As illustrated by the 
figure, customer case A was caused by a software bug, 
while customer case B was caused by a misconfiguration 
(details are explained in the caption). 

For a small majority of common critical events (13 out 
of 20), there is a dominant (> 65%) high-level problem 
root cause. Therefore, we conclude that critical events 
can be used to infer the high-level problem root causes. 
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However, the high-level root cause isn’t enough to re- 
solve the customer’s problems. One needs to determine 
the precise root cause. In the next section, we see if crit- 
ical events at least help us narrow down the root cause to 
specific storage system modules. 


4.2 Module-level Problem Root Causes 


A module-level problem root cause defines which 
module or component? caused the problem experienced 
by the customer. Zooming into one particular module is 
a significant step towards problem resolution. With such 
knowledge, customer cases can be effectively assigned 
to the experts who are familiar with that module. 

Figure 10 presents the distribution of module-level 
problem root causes among the customer cases with the 
same critical event. The same data set was used as for 
Figure 8. The selected customer cases were diagnosed 
with 13 different module-level root causes. The figure 
shows that for only 4 out of 20 messages, there is a dom- 
inant (> 65%) module-level problem root cause. There- 
fore critical events are not indicative of module-level 
problem root causes. 

One explanation is that modules in the storage stack 
have complex interactions. Multiple code paths can lead 
to the same failure symptom. An example is critical 
event 03 (Disk Failure Message), which is quite indica- 
tive (> 75%) of a hardware failure; however, an error 
in multiple hardware modules can lead to this message. 
Figure 11 illustrates two real-world customer cases trig- 
gered by Disk Failure Messages. As the figure explains, 
customer case C was actually due to Fibre Channel Loop 
instability while customer case D was caused by multiple 
disk medium errors on the same disk. 

Since APIs between modules enforce clean separation 
between caller and callee, modules tend to log “local” 
state information i.e. what happens within the module. 
Theoretically a more sophisticated logging infrastructure 
could store the interactions between modules and gener- 
ate the critical events that capture “global” system state. 


3We will use module to represent both software module and hard- 
ware component in the rest of the paper 
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Figure 12. Comparison between three methods of using log events. F-score indicates how accurate 
a prediction can be made on module-level problem root cause using log information. The same set of customer cases are 
used here as for Figure 8, except customer cases without AutoSupport logs in AutoSupport Database, ending up with 4,535 


customer cases. 


However, we believe it is impractical to build such log- 
ging infrastructure for existing commercial products, due 
to the complexity of module interaction. Furthermore, 
such infrastructure would be very hard to maintain as the 
system evolves and more modules are added. We believe 
the solution is to combine the critical log event with other 
log information and in the next section we study the fea- 
sibility of doing so. 


5 Feasibility of Using Logs for Automating 
Troubleshooting 


As we analyzed in the previous section, critical events 
alone are not enough for identifying the problem root 
cause beyond a high level. This conclusion is supported 
by several real-world customer cases presented in Fig- 
ure 9 and Figure 11. These customer cases also suggest 
that log events in addition to the critical events can be 
quite useful for identifying the problem root causes. 

In this section, we investigate the feasibility of using 
additional information from system logs and answer the 
following two questions: Does problem root cause de- 
termination improve by considering log events beyond 
critical events? What kind of log events are key to iden- 
tifying the problem root cause? 


5.1 Are additional log events useful ? 


To study whether additional log events are useful, we 
consider three methods of using log event information, 
and compare how well they can be used as a module- 
level problem root cause signature. We define a signa- 
ture as a set of relevant log events that uniquely identify a 
problem root cause. Such a signature can be used to iden- 
tify recurring problems and to distinguish one problem 
from another unrelated one, thereby helping with cus- 
tomer troubleshooting. It is important to note that we are 
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not designing algorithms to find log signatures, instead 
we are manually computing log signatures to study how 
they improve problem root cause determination. 

As a baseline, our first method is to only use the prob- 
lem’s critical event as its signature. For each module- 
level problem root cause, using a set of manually di- 
agnosed cases as training data, we search for one criti- 
cal event that can best differentiate customer cases diag- 
nosed with this root cause from other customer cases. 
More specifically, for each module-level problem root 
cause, we exhaustively search through all critical events, 
and calculate their F-score, which measures how well 
the critical event can be used to predict the problem root 
cause [49]. Then we pick the critical event with the high- 
est F-score as the signature for this module-level prob- 
lem root cause. 

The second method is similar to method one. But in- 
stead of just looking at critical events to deduce a root 
cause signature, we search all log events looking for the 
one log message that best indicated the module-level root 
cause. If this method can find log signatures with much 
better F-score, it indicates that some log events other 
than critical events provide more valuable information 
for identifying problem root cause. 

The third method is to use a decision tree [9] to find 
the best mapping between multiple log events and the 
problem root cause. The resulting multiple log events 
can be used as the root cause signature. 

For all three methods, we use the same set of cus- 
tomer cases as in Figure 8, except removing customer 
cases without AutoSupport logs. This gives us 4,535 cus- 
tomer cases. A random selection of 60% of these cases is 
used as training data, while the remaining 40% are used 
as testing data. 

As Figure 12 shows, for all customer cases, using only 
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Table 1. Characteristics of Log Signatures. We manually studied 35 customer cases. These 35 customer cases can be grouped into 10 
groups, where each group had the same problem root cause. Based on diagnosis notes from engineers, we were able to identify the key 
log events, which can differentiate cases in one group cases in another. “# of Key Log Events” is the total number of important log events 
(including critical events) needed to identify the problem. “Distance” is calculated as the longest distance from a key log event to a critical 


event for each customer case, averaged across all cases. 


critical events as the problem signature is a very poor 
predictor of root cause. On average, it only achieves an 
F-score of about 0.15. Using the best matched log event, 
instead of just critical events, can achieve an F-score 
0.27. By comparison, the average F-score achieved by 
the decision tree method for computing problem signa- 
tures is 0.45, which is 3x better than using critical events. 
Based on these results, we conclude that accurate prob- 
lem root cause determination requires combining multi- 
ple log events rather than a single log event or critical 
event. This observation matters, since customer support 
personnel usually focus on the critical event, which can 
be misleading. Furthermore, as we show in the next sec- 
tion, there is often a lot of noise between key log events 
making it hard to manually detect problem signatures. 

Although we use the decision tree to construct log sig- 
natures that are composed of multiple log events, we do 
not advocate this technique as the solution for utilizing 
log information. First of all, the accuracy(F-score) is still 
not satisfactory due to log noise, which we discuss later. 
Moreover, the effectiveness of the decision tree relies on 
training data. For problem root causes that do not have a 
large number of diagnosed instances, a decision tree will 
not provide much help. 


5.2 Challenges of using log information 


To understand the challenges of using log information 
and identifying key log events to compute a problem sig- 
nature, we manually analyzed 35 customer cases sam- 
pled from the Engineering Case Database. These cus- 
tomer cases were categorized into 10 groups, such that 
cases in each group had the same problem root cause. 

For these customer cases, we noticed that engineers 
used several key log events to diagnose the root cause. 
Table 1 summarizes these cases and characteristics of 
their key log events. 

Based on these 10 groups, we made following major 
observations: 


(1) Logs are noisy. 
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Figure 13. Cumulative Distribution Func- 
tion (CDF) of number of log events within 
one hour of critical event. For this figure, we 
use the same data set as Figure 12. We only count the 
log events generated and recorded by AutoSupport sys- 
tem within one hour before the critical event, since prac- 
tically engineers often only exam recent log events for 
problem diagnosis. 


Figure 13 shows the Cumulative Distribution Func- 
tion (CDF) of the number of log events in AutoSupport 
logs corresponding to customer cases. As can be seen 
in the figure, for majority of the customer cases ( 75%), 
there are more than 100 log events recorded within an 
hour before the critical event occurred, and for the top 
20% customer cases, more than 1000 log events were 
recorded. 


In comparison, as Table | shows, there are usually 
only 2-4 key log events for a given problem, implying 
that most log events are just noise for the problem. 


(2) Important log events are not easy to locate. 


Table | shows the distance between key log events and 
critical events, both in terms of time and the number of 
log events. For 6 out of 10 problems, at least one key log 
event is more than 30 log events away from the critical 
event, which captures the failure point. For all problems, 
there are always some irrelevant log events in between 
the key log events and the critical event. In terms of time, 
the key log events can be minutes or even hours before 


USENIX Association 


USENIX Association 


the critical event. 
(3) The pattern of key log events can be fuzzy. 


Sometimes, it is not necessary to have an exact set of 
key log events for identifying a particular problem. Us- 
ing problem 7 as an example, it is not necessary to see 
“raidDiskInsert’ log event, depending on how the system 
administrator added the disk drive. Another example is 
problem 2. The same shelf intraconnect error can be de- 
tected by different modules, and different log messages 
can be seen for it depending on which module reports the 
issue. 


5.3 Preliminary Prototype for Automatic Log 
Analysis 


Based on the above observations, we designed and 
implemented a log analysis prototype to improve the cus- 
tomer troubleshooting process. It is important to note, 
we are still exploring the design space and evaluating the 
effectiveness of our log analyzer on real world customer 
cases. 

Our analyzer contains two major functions: extract- 
ing log signatures and grouping similar logs sequences. 
As discussed in observation (1), system logs are very 
“noisy”, containing many log events irrelevant to the 
problem. We also observed (Table 1) that 2-4 key log 
events are sufficient to serve as a problem signature. 

In order to extract log signatures, our log analyzer au- 
tomatically ranks log events based on their “importance” 
As mentioned in observation (2), important log events 
are difficult to locate and can be far away from criti- 
cal events (failure points). To solve this challenge, we 
apply statistical techniques to infer the dependency be- 
tween the system states represented by log events. Then 
we design a heuristic algorithm to estimate the “impor- 
tance” of a log event based on the following two rules: 

(1) Between two dependent log events, the temporally 
precedent event is usually more important than its suc- 
cessor. If two log events are dependent, the earlier one 
usually captures the system state that is closer to the be- 
ginning of the error propagation process. 

(2) The larger dependence “fan-out” a log event has, 
the more important it is. Our reasoning is that if a log 
event has a dependence relationship with many other log 
events and it precedes other log events, it signifies a crit- 
ical system state. 

In this manner, we compute “important” log events 
for a given problem and rank the top four events which 
we then use as the problem signature. Even if the sig- 
nature is not entirely accurate, we believe the process of 
extracting important events and highlighting those can 
greatly reduce the time spent by customer support staff 
in manually analyzing logs. 

The second function of our log analyzer is to identify 
similar log sequences As described in observation (3), 


similar log sequences, that represent the same problem 
root cause, might not have exactly the same set of key 
log events. Therefore, our log grouping engine clusters 
logs based on their similarity, by mapping log signatures 
into a vector space with each log event as a dimension. 
We then apply unsupervised classification techniques to 
group similar sequences together based on their relative 
positions in the vector space [41]. 

Since we are still exploring the design space and eval- 
uating the effectiveness of our log analysis techniques, 
the details of the log analyzer are beyond the scope of 
this paper and remain as our future work. 


6 Related Work 
6.1 Problem Characteristic Studies 


There have been many prior studies that categorize 
computer system problems and identify root causes such 
as we have done. 

A number of studies show that operator mistakes are 
one of the major causes of failures. One of the first stud- 
ies of fault analysis on commercial fault-tolerant sys- 
tems [21] analyzes Tandem System outages with more 
than 2000 systems in scope. Gray classifies causes into 
5 major categories and 13 sub-categories, and finds that 
operator error is the largest single cause of failure in 
deployed Tandem systems. Murphy and Gent examine 
causes of system crashes in VAX systems between 1985 
and 1993, and find that system management caused more 
than half of the failures, software about 20%, and hard- 
ware about 10% [42]. Similarly, the characteristic study 
by Oppenheimer et al. classifies Internet service failures 
into component failures and service failures, and further 
analyzes root causes for each failure type for each In- 
ternet service [45]. They also found that operator error 
is the largest cause of failures in two of the three ser- 
vices, and configuration errors are the largest category 
of operator errors. While their work focuses on system 
outages, we are also interested in failures that don’t lead 
to outages. We classify storage system failures based on 
symptoms as well as root causes, and further show the 
correlations between problem root cause, symptom and 
resolution time. 

Ganapathi et al. have developed a categorization 
framework for Windows registry related problems [20]. 
Similar to our work, their classification is based on prob- 
lem manifestation and scope of impact to help under- 
stand the problem. Although they have described some 
causes to problem manifestations, they do not have a 
clear classification for it. Since our goal is to be able 
to do problem diagnosis, we study not only the problem 
symptoms, but also root causes of those symptoms. 

Some failure studies are also conducted on storage 
systems. Jiang et al. conduct a characteristic study of 
NetApp® storage subsystem failures [27, 28]. They clas- 
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sify storage subsystem failures into four types, and then 
study how storage subsystem components can affect stor- 
age subsystem reliability. 


6.2 Troubleshooting Studies 


Since troubleshooting is very time-consuming, quite 
a few studies have been trying to make it more efficient 
by automating the process. By studying characteristics 
of problem tickets in an enterprise IT infrastructure, re- 
searchers in IBM T-.J. Watson built PDA, a problem diag- 
nosis tool, to help solving problems more efficiently [24]. 
Banga attempts to automate the diagnosis process of ap- 
pliance field problems that is usually performed by hu- 
man experts: system health monitoring and error de- 
tection, component sanity checking, and configuration 
change tracking [6]. Redstone et al. propose a vision 
of an automated problem diagnosis system by captur- 
ing symptoms from users’ desktops and matching them 
against problem database [48]. 

In order to make this process automated, knowledge 
about detection and checking rules and logic has to be 
predefined by human experts. Cohen et al. present a 
method for extracting signatures from system states to 
help identify recurrent problems and leverage previous 
diagnosis efforts [15]. Alternatively, by comparing the 
target configuration file with the mass of healthy con- 
figuration files [55], Wang et al. identified problematic 
configuration entries that cause Windows® system prob- 
lems. Similarly, Wang [56] and Lao [34] address miscon- 
figuration problems in Windows systems by building and 
identifying signatures of normal and abnormal Windows 
Registry entries. Some studies apply some advanced 
techniques such as data mining to troubleshooting. For 
example, PinPoint [14, 13] traces and collects requests, 
and performs data clustering analysis on them to deter- 
mine the combinations of components that are likely to 
be the cause of failures. 

It is important to collect system traces for trou- 
bleshooting like AutoSupport logging systems. Mag- 
pie [7], Flight Data Recorder [54], and the work by Yuan 
et al. [58] improve system management by using fine- 
grained system event-tracing mechanisms and analysis. 
Stack back traces are used by several diagnostic systems, 
including Dr. Watson [18], Gnome’s bug-buddy [11], and 
IBM diagnosis tool [40]. 


6.3 Log Analysis 


There are two major directions taken by previous re- 
searchers to analyze system logs: tupling and depen- 
dency extraction. 

As a system failure may propagate through multi- 
ple system components, multiple log events indicating 
failure or abnormal status of components can be gener- 
ated during a short period of time. Based on this ob- 
servation, several studies try to reduce the complexity of 
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system logs by grouping successive log events into tu- 
ples [8, 10, 23, 26, 37, 38, 53]. For example, Tsao [53], 
Hansen [23] and Lin [38] applied variants of tupling al- 
gorithms on system logs collected from VAX/VMS ma- 
chines. The tupling algorithms explore the time-space 
relationship between log events, and cluster temporally 
related events into tuples, so that the number of logical 
entities can be significantly reduced. The limitation of 
tupling algorithms is that log events in a tuple may be 
unrelated if related log events are interleaving with irrel- 
evant log events. Unfortunately, based on our study on 
modern system logs, such a limitation is fatal. 

Another direction taken by previous studies is to ex- 
tract dependency between log events. Steinle et al. [50] 
apply two data mining techniques, aiming at finding the 
dependency between two events in a log collected from 
Geneva university hospitals environment. The first tech- 
nique estimates the distribution of temporal distance be- 
tween two events, and compares against random distri- 
bution. The second technique extracts the correlation be- 
tween two event types using association statistics. Aguil- 
era et al. [2] apply signal processing techniques to extract 
dependency between events. The main hypothesis be- 
hind this work is that if two events are correlated, one 
or a few typical temporal gaps between these two events 
can be found through signal processing. Our study is fo- 
cused on characteristic study on manually identified key 
log events, and discusses the challenges and opportuni- 
ties for applying log analysis. Several observations made 
in our study using storage system logs are consistent with 
conclusions made in [31]. Both studies identified that 
the noisy and redundant log information make log anal- 
ysis a challenging task and there is great value to extract 
event correlations for capturing error context and prop- 
agation. However, comparing to [31], which made a 
qualitative study using 2-week distributed system logs, 
our study looked at 4,769 storage system log files with 
the corresponding real-world problem diagnosis, carried 
out a quantitative study on the usefulness of logs, and 
proposed an automatic log analysis solution. 


7 Conclusion 


In this paper, we present one of the first studies of 
the characteristics of customer problem troubleshooting 
from logs, using a large set of customer support cases 
from NetApp. Our results show that customer problem 
troubleshooting is a very time consuming and challeng- 
ing task, and can benefit from automation to speedup res- 
olution time. We observed that customer problems with 
attached logs were invariably resolved sooner than those 
without logs. We show that while a single log event, 
or critical log event is a poor predictor of problem root 
cause, combining multiple key log events leads to a 3x 
improvement in root cause determination. Our results 
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also show that logs are challenging to analyze manually 
because they are noisy and that key log events are often 
separated by hundreds of unrelated log messages. We 
then outlined our ideas for an automatic log analysis tool 
that can speed up problem resolution time. 

Similar to other characteristic studies, it is impossible 
to study a handful of different data sets, especially for 
customer support problems due to the unavailability of 
such data sets. Even though our data set (which is already 
very large with 636,108 cases from 100,000 systems) is 
limited only to NetApp, we believe that this study is an 
important first-step in quantifying both the usefulness of 
and challenge in using logs for customer problem trou- 
bleshooting. We hope that our study can inspire and mo- 
tivate characteristic studies about other kinds of systems 
as well, and motivate the creation of new tools for au- 
tomated log analysis for customer problem troubleshoot- 
ing. 
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Abstract 


We present DIADS, an integrated DJAgnosis tool for 
Databases and Storage area networks (SANs). Existing 
diagnosis tools in this domain have a database-only (e.g., 
[11]) or SAN-only (e.g., [28]) focus. DIADS is a first- 
of-a-kind framework based on a careful integration of in- 
formation from the database and SAN subsystems; and 
is not a simple concatenation of database-only and SAN- 
only modules. This approach not only increases the ac- 
curacy of diagnosis, but also leads to significant improve- 
ments in efficiency. 

DIADS uses a novel combination of non-intrusive ma- 
chine learning techniques (e.g., Kernel Density Estima- 
tion) and domain knowledge encoded in a new symptoms 
database design. The machine learning component pro- 
vides core techniques for problem diagnosis from mon- 
itoring data, and domain knowledge acts as checks-and- 
balances to guide the diagnosis in the right direction. 
This unique system design enables DIADS to function 
effectively even in the presence of multiple concurrent 
problems as well as noisy data prevalent in production 
environments. We demonstrate the efficacy of our ap- 
proach through a detailed experimental evaluation of DI- 
ADS implemented on a real data center testbed with Post- 
greSQL databases and an enterprise SAN. 


1 Introduction 


“The online transaction processing database myOLTP 
has a 30% slow down in processing time, compared to 
performance two weeks back.” This is a typical prob- 
lem ticket a database administrator would create for the 
SAN administrator to analyze and fix. Unless there is an 
obvious failure or degradation in the storage hardware 
or the connectivity fabric, the response to this problem 
ticket would be: “The I/O rate for myOLTP tablespace 
volumes has increased 40%, with increased sequential 
reads, but the response time is within normal bounds.” 
This to-and-fro may continue for a few weeks, often driv- 
ing SAN administrators to take drastic steps such as mi- 
grating the database volumes to a new isolated storage 
controller or creating a dedicated SAN silo (the inverse 
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of consolidation, explaining in part why large enterprises 
still continue to have highly under-utilized storage sys- 
tems). The myOLTP problem may be fixed eventually 
by the database administrator realizing that a change in a 
table’s properties had made the plan with sequential data 
scans inefficient; and the I/O path was never an issue. 

The above example is a realistic scenario from large 
enterprises with separate teams of database and SAN 
administrators, where each team uses tools specific to 
its own subsystem. With the growing popularity of 
Software-as-a-Service, this division is even more pre- 
dominant with application administrators belonging to 
the customer, while the computing infrastructure is pro- 
vided and maintained by the service provider administra- 
tors. The result is a lack of end-to-end correlated infor- 
mation across the system stack that makes problem diag- 
nosis hard. Problem resolution in such cases may require 
either throwing iron at the problem and re-creating re- 
source silos, or employing highly-paid consultants who 
understand both databases and SANs to solve the perfor- 
mance problem tickets. 

The goal of this paper is to develop an integrated di- 
agnosis tool (called DIADS) that spans the database and 
the underlying SAN consisting of end-to-end I/O paths 
with servers, interconnecting network switches and fab- 
ric, and storage controllers. The input to DIADS is a 
problem ticket from the administrator with respect to a 
degradation in database query performance. The out- 
put is a collection of top-K events from the database and 
SAN that are candidate root causes for the performance 
degradation. Internally, DIADS analyzes thousands of 
entries in the performance and event logs of the database 
and individual SAN devices to shortlist an extremely se- 
lective subset for further analysis. 


1.1 Challenges in Integrated Diagnosis 


Figure 1 shows an integrated database and SAN tax- 
onomy with various logical (e.g., sort and scan opera- 
tors in a database query plan) and physical components 
(e.g., server, switch, and storage controller). Diagnosis 
of problems within the database or SAN subsystem is an 
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Figure 1: Example database/SAN deploy- 
ment. 


area of ongoing research (described later in Section 2). 
Integrated diagnosis across multiple subsystems is even 
more challenging: 


e High-dimensional search space: Integrated analysis 
involves a large number of entities and their combi- 
nations (see Figure 1). Pure machine learning tech- 
niques that aim to find correlations in the raw mon- 
itoring data—which may be effective within a sin- 
gle subsystem with few parameters—can be ineffec- 
tive in the integrated scenario. Additionally, real- 
world monitoring data has inaccuracies (i.e., the data 
is noisy). The typical source of noise is the large 
monitoring interval (5 minutes or higher in produc- 
tion environments) which averages out the instanta- 
neous effects of spikes and other bursty behavior. 

e Event cascading and impact analysis: The cause and 
effect of a problem may not be contained within a 
single subsystem (i.e., event flooding may result). 
Analyzing the impact of an event across multiple 
subsystems is a nontrivial problem. 

e Deficiencies of rule-based approaches: Existing di- 
agnosis tools for some commercial databases [11] 
use a rule-based approach where a root-cause tax- 
onomy is created and then complemented with rules 
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Figure 2: Taxonomy of scenarios for root-cause analysis. 


to map observed symptoms to possible root causes. 
While this approach has the merit of encoding valu- 
able domain knowledge for diagnosis purposes, it 
may become complex to maintain and customize. 


1.2 Contributions 


The taxonomy of problem determination scenarios han- 
dled by DIADS is shown in Figure 2. The events in 
the SAN subsystem can be broadly classified into con- 
figuration changes (such as allocation of new applica- 
tions, change in interconnectivity, firmware upgrades, 
etc.) and component failure or saturation events. Simi- 
larly, database events could correspond to changes in the 
configuration parameters of the database, or a change in 
the workload characteristics driven by changes in query 
plans, data properties, etc. The figure represents a matrix 
of change events, with relatively complex scenarios aris- 
ing due to combinations of SAN and database events. In 
real-world systems, the no change category is mislead- 
ing, since there will always be change events recorded 
in management logs that may not be relevant or may not 
impact the problem at hand; those events still need to be 
filtered by the problem determination tool. For complete- 
ness, there is another dimension (outside the scope of this 
paper) representing transient effects, e.g., workload con- 
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tention causing transient saturation of components. 
The key contributions of this paper are: 


e A novel workflow for integrated diagnosis that uses 
an end-to-end canonical representation of database 
query operations combined with physical and logical 
entities from the SAN subsystem (referred to as de- 
pendency paths). DIADS generates these paths by an- 
alyzing system configuration data, performance met- 
rics, as well as event data generated by the system or 
by user-defined triggers. 

e The workflow is based on an innovative combination 
of machine learning, domain knowledge of configu- 
ration and events, and impact analysis on query per- 
formance. This design enables DIADS to address the 
integrated diagnosis challenges of high-dimensional 
space, event propagation, multiple concurrent prob- 
lems, and noisy data. 

e An empirical evaluation of DIADS on a real-world 
testbed with a PostgreSQL database running on an 
enterprise-class storage controller. We describe prob- 
lem injection scenarios including combinations of 
events in the database and SAN layers, along with a 
drill-down into intermediate results given by DIADS. 


2 Related Work 


We give an overview of relevant database (DB), storage, 
and systems diagnosis work, some of which is comple- 
mentary and leveraged by our integrated approach. 


2.1 Independent DB and Storage Diagnosis 


There has been significant prior research in performance 
diagnosis and problem determination in databases [11, 
10, 20] as well as enterprise storage systems [25, 28]. 
Most of these techniques perform diagnosis in an isolated 
manner attempting to identify root cause(s) of a perfor- 
mance problem in individual database or storage silos. In 
contrast, DIADS analyzes and correlates data across the 
database and storage layers. 

DB-only Diagnosis: Oracle’s Automatic Database Diag- 
nostic Monitor (ADDM) [10, 11] performs fine-grained 
monitoring to diagnose database performance problems, 
and to provide tuning recommendations. A similar sys- 
tem [6] has been proposed for Microsoft SQLServer. (In- 
terested readers can refer to [33] for a survey on database 
problem diagnosis and self-tuning.) However, these tools 
are oblivious to the underlying SAN layer. They cannot 
detect problems in the SAN, or identify storage-level root 
causes that propagate to the database subsystem. 
Storage-only Diagnosis: Similarly, there has been re- 
search in problem determination and diagnosis in en- 
terprise storage systems. Genesis [25] uses machine 
learning to identify abnormalities in SANs. A disk I/O 
throughput model and statistical techniques to diagnose 
performance problems in the storage layer are described 


in [28]. There has also been work on profiling tech- 
niques for local file systems [3, 36] that help collect data 
useful in identifying performance bottlenecks as well as 
in developing models of storage behavior [18, 30, 21]. 
Drawbacks: Independent database and storage analysis 
can help diagnose problems like deadlocks or disk fail- 
ures. However, independent analysis may fail to diag- 
nose problems that do not violate conditions in any one 
layer, rather contribute cumulatively to the overall poor 
performance. Two additional drawbacks exist. First, it 
can involve multiple sets of experts and be time consum- 
ing. Second, it may lead to spurious corrective actions as 
problems in one layer will often surface in another layer. 
For example, slow I/O due to an incorrect storage vol- 
ume placement may lead a DB administrator to change 
the query plan. Conversely, a poor query plan that causes 
a large number of I/Os may lead the storage administra- 
tor to provision more storage bandwidth. 

Studies measuring the impact of storage systems on 
database behavior [27, 26] indicate a strong interdepen- 
dence between the two subsystems, highlighting the im- 
portance of an integrated diagnosis tool like DIADS. 


2.2 System Diagnosis Techniques 


Diagnosing performance problems has been a popular re- 
search topic in the general systems community in recent 
years [32, 8, 9, 35, 4, 19]. Broadly, this work can be split 
into two categories: (a) systems using machine learn- 
ing techniques, and (b) systems using domain knowl- 
edge. As described later, DIADS uses a novel mix where 
machine learning provides the core diagnosis techniques 
while domain knowledge serves as checks-and-balances 
against spurious correlations. 

Diagnosis based on Machine Learning: PeerPressure 
[32] uses statistical techniques to develop models for a 
healthy machine, and uses these models to identify sick 
machines. Another proposed method [4] builds models 
from process performance counters in order to identify 
anomalous processes that cause computer slowdowns. 
There is also work on diagnosing problems in multi- 
tier Web applications using machine learning techniques. 
For example, modified Bayesian network models [8] and 
ensembles of probabilistic models [35] that capture sys- 
tem behavior under changing conditions have been used. 
These approaches treat data collected from each subsys- 
tem equally, in effect creating a single table of perfor- 
mance metrics that is input to machine learning modules. 
In contrast, DIADS adds more structure and semantics to 
the collected data, e.g., to better understand the impact 
of database operator performance vs. SAN volume per- 
formance. Furthermore, DIADS complements machine 
learning techniques with domain knowledge. 

Diagnosis based on Domain Knowledge: There are also 
many systems, especially in the DB community, where 
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domain knowledge is used to create a symptoms database 
that associates performance symptoms with underlying 
root causes [34, 19, 24, 10, 11]. Commercial vendors 
like EMC, IBM, and Oracle use symptom databases for 
problem diagnosis and correction. While these databases 
are created manually and require expertise and resources 
to maintain, recent work attempts to partially automate 
this process [9, 12]. 

We believe that a suitable mix of machine learning 
techniques and domain knowledge is required for a diag- 
nosis tool to be useful in practice. Pure machine learning 
techniques can be misled by spurious correlations in data 
resulting from noisy data collection or event propaga- 
tion (where a problem in one component impacts another 
component). Such effects need to be addressed using ap- 
propriate domain knowledge, e.g., component dependen- 
cies, symptoms databases, and knowledge of query plan 
and operator relationships. 

It is also important to differentiate DIADS from 
tracing-based techniques [7, 1] that trace messages 
through systems end-to-end to identify performance 
problems and failures. Such tracing techniques require 
changes in production system deployments and often add 
significant overhead in day-to-day operations. In con- 
trast, DIADS performs a postmortem analysis of moni- 
tored performance data collected at industry-standard in- 
tervals to identify performance problems. 

Next, we provide an overview of DIADS. 


3 Overview of DIADS 


Suppose a query Q that a report-generation application 
issues periodically to the database system shows a slow- 
down in performance. One approach to track down the 
cause is to leverage historic monitoring data collected 
from the entire system. There are several product of- 
ferings [13, 15, 16, 17, 31] in the market that collect and 
persist monitoring data from IT systems. 

DIADS uses a commercial storage management 
server—IBM TotalStorage Productivity Center [17]— 
that collects monitoring data from multiple layers of the 
IT stack including databases, servers, and the SAN. The 
collected data is transformed into a tabular format, and 
persisted as time-series data in a relational database. 


SAN-level data: The collected data includes: (i) con- 
figuration of components (both physical and logical), (ii) 
connectivity among components, (iii) changes in config- 
uration and connectivity information over time, (iv) per- 
formance metrics of components, (v) system-generated 
events (e.g., disk failure, RAID rebuild) and (vi) events 
generated by user-defined triggers [14] (e.g., degradation 
in volume performance, high workload on storage sub- 
system). 

Database-level data: To execute a query, a database sys- 
tem generates a plan that consists of operators selected 
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Figure 3: DIADS’s diagnosis workflow 


from a small, well-defined family of operators [14]. Let 
us consider an example query Q: 


























SELECT Product.Category, SUM(Product.Sales) 
FRO Product 

WHERE Product.Price > 1000 

GROUP BY Product.Category 


Q asks for the total sales of products, priced above 1000, 
grouped per category. Figure 1 shows a plan P to exe- 
cute Q. P consists of four operators: an Index Scan of 
the index on the Price attribute, a Fetch to bring match- 
ing records from the Product table, a Sort to sort these 
records on Category values, and a Grouping to do the 
grouping and summation. For each execution of P, DI- 
ADS collects some monitoring data per operator O. The 
relevant data includes: O’s start time, stop time, and 
record-count (number of records returned in O’s output). 


DIADS’s Diagnosis Interface: DIADS presents an inter- 
face where an administrator can mark a query as having 
experienced a slowdown. Furthermore, the administrator 
either specifies declaratively or marks directly the runs of 
the query that were satisfactory and those that were un- 
Satisfactory. For example, runs with running time below 
100 seconds are satisfactory, or all runs between 8 AM 
and 2 PM were satisfactory, and those between 2 PM and 
3 PM were unsatisfactory. 


Diagnosis Workflow: DIADsS then invokes the workflow 
shown in Figure 3 to diagnose the query slowdown based 
on the monitoring data collected for satisfactory and un- 
satisfactory runs. By default, the workflow is run in a 
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batch mode. However, the administrator can choose to 
run the workflow in an interactive mode where only one 
module is run at a time. After seeing the results of each 
module, the administrator can edit the data or results be- 
fore feeding them to the next module, bypass or reinvoke 
modules, or stop the workflow. Because of space con- 
straints, we will not discuss the interactive mode further 
in this paper. 

The first module in the workflow, called Module Plan- 
Diffing (PD), looks for significant changes between the 
plans used in satisfactory and unsatisfactory runs. If such 
changes exist, then DIADS tries to pinpoint the cause of 
the plan changes (which includes, e.g., index addition or 
dropping, changes in data properties, or changes in con- 
figuration parameters used during plan selection). The 
techniques used in this module contain details specific to 
databases, so they are covered in a companion paper [5]. 

The remaining modules are invoked if DIADS finds a 
plan P that is involved in both satisfactory and unsat- 
isfactory runs of the query. We give a brief overview 
before diving into the details in Section 4: 

e Module Correlated Operators (CO): DIADs finds 
the (nonempty) subset of operators in P whose 
change in performance correlates with the query 
slowdown. The operators in this subset are called 
correlated operators. 

e Module Dependency Analysis (DA): Having identi- 
fied the correlated operators, DIADS uses a combina- 
tion of correlation analysis and the configuration and 
connectivity information collected during monitoring 
to identify the components in the system whose per- 
formance is correlated with the performance of the 
correlated operators. 

e Module Correlated Record-counts (CR): Next, 
DIADS checks whether the change in P’s perfor- 
mance is correlated with the record-counts of P’s op- 
erators. If significant correlations exist, then it means 
that data properties have changed between satisfac- 
tory and unsatisfactory runs of P. 

e Module Symptoms Database (SD): The correla- 
tions identified so far are likely symptoms of the root 
cause(s) of query slowdown. Other symptoms may 
be present in the stream of system-generated events 
and trigger-generated (user-defined) semantic events. 
The combination of these symptoms is used to probe 
a symptoms database that maps symptoms to the un- 
derlying root cause(s). The symptoms database im- 
proves diagnosis accuracy by dealing with the propa- 
gation of faults across components as well as missing 
symptoms, unexpected symptoms (e.g., spurious cor- 
relations), and multiple simultaneous problems. 

e Module Impact Analysis (IA): The symptoms 
database computes a confidence score for each sus- 
pected root cause. For each high-confidence root 
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cause R, DIADS performs impact analysis to answer 
the following question: if R is really a cause of 
the query slowdown, then what fraction of the query 
slowdown can be attributed to R. To the best of our 
knowledge, DIADs is the first automated diagnosis 
tool to have an impact-analysis module. 


Integrated database/SAN diagnosis: Note that the 
workflow “drills down” progressively from the level of 
the query to plans and to operators, and then uses de- 
pendency analysis and the symptoms database to further 
drill down to the level of performance metrics and events 
in components. Finally, impact analysis is a “roll up” 
to tie potential root causes back to their impact on the 
query slowdown. The drill down and roll up are based 
on a careful integration of information from the database 
and SAN layers; and is not a simple concatenation of 
database-only and SAN-only modules. Only low over- 
head monitoring data is used in the entire process. 


Machine learning + domain knowledge: DIADS’s 
workflow is a novel combination of elements from ma- 
chine learning with the use of domain knowledge. A 
number of modules in the workflow use correlation anal- 
ysis which is implemented using machine learning; the 
details are in Sections 4.1 and 4.2. Domain knowledge is 
incorporated into the workflow in Modules DA, SD, and 
IA; the details are given respectively in Sections 4.2-4.4. 
(Domain knowledge is also used in Module PD which is 
beyond the scope of this paper.) As we will demonstrate, 
the combination of machine learning and domain knowl- 
edge provides built-in checks and balances to deal with 
the challenges listed in Section 1. 


4 Modules in the Workflow 


We now provide details for all modules in DIADS’s diag- 
nosis workflow. Upfront, we would like to point out that 
our main goal is to describe an end-to-end instantiation 
of the workflow. We expect that the specific implemen- 
tation techniques used for the modules will change with 
time as we gain more experience with DIADS. 


4.1 Identifying Correlated Operators 


Objective: Given a plan P that is involved in both sat- 
isfactory and unsatisfactory runs of the query, DIADS’s 
objective in this module is to find the set of correlated 
operators. Let O1, Oo, ...,On be the set of all opera- 
tors in P. The correlated operators form the subset of 
Oj,,...,On whose change in running time best explains 
the change in P’s running time (i.e., P’s slowdown). 


Technique: DIADS identifies the correlated oper- 
ators by analyzing the monitoring data collected 
during satisfactory and unsatisfactory runs of P. 
This data can be seen as records with attributes 
A,t(P),t(O1),t(O2),...,t(On) for each run of P. 
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Here, attribute ¢(P) is the total time for one complete run 
of P, and attribute t(O;) is the running time of operator 
O; for that run. Attribute A is an annotation (or label) 
associated with each record that represents whether the 
corresponding run of P was satisfactory or not. Thus, 
A takes one of two values: satisfactory (denoted S) or 
unsatisfactory (denoted U). 

Let the values of attribute ¢(O;) in records with an- 
notation S be 51, 52,..., 5%, and those with annotation 
U be uy, U2,.--, Wy. That is, s1,...,s, are k observa- 
tions of the running time of operator O; when the plan P 
ran satisfactorily. Similarly, ui, u2,..., uz are 1 observa- 
tions of the running time of O; when the running time of 
P was unsatisfactory. DIADS pinpoints correlated oper- 
ators by characterizing how the distribution of s1,..., 8% 
differs from that of u;,...,«;. For this purpose, DIADS 
uses Kernel Density Estimation (KDE) [22]. 

KDE is a non-parametric technique to estimate the 
probability density function of a random variable. Let S; 
be the random variable that represents the running time 
of operator O; when the overall plan performance is sat- 
isfactory. KDE applies a kernel density estimator to the 
k observations s1,..., 5, of S; to learn S;’s probability 
density function f;(S;). ; , 

Near Kk (HE) 
fi(Si) = kh (1) 
Here, K is a kernel function and h is a smoothing param- 
eter. A typical kernel is the standard Gaussian function 
2 


K(a)=£ Tee ! 
a generalization and improvement over histograms.) 

Let u be an observation of operator O;’s running time 
when the plan performance was unsatisfactory. Consider 
the probability estimate prob(S; < u) = J". fi(Si)dsi. 
Intuitively, as u becomes higher than the typical range of 
values of S;, prob(S; < wu) becomes closer to 1. Thus, 
a high value of prob(.S; < wu) represents a significant 
increase in the running time of operator O; when plan 
performance was unsatisfactory compared to that when 
plan performance was satisfactory. 

Specifically, DIADS includes O; in the set of corre- 
lated operators if prob(S; <u) > 1— a. Here, @ is the 
average of u,,...,u; and a is a small positive constant. 
a = 0.1 by default. For obvious reasons, prob(.S; < %) 
is called the anomaly score of operator Oj. 





(Intuitively, kernel density estimators are 


4.2 Dependency Analysis 


Objective: This module takes the set of correlated op- 
erators as input, and finds the set of system components 
that show a change in performance correlating with the 
change in running time of one of more correlated opera- 
tors. 


Technique: DIADS implements this module using de- 
pendency analysis which is based on generating and 
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pruning dependency paths for the correlated operators. 
We describe the generation and pruning of dependency 
paths in turn. 


Generating dependency paths: The dependency path of 
an operator O, is the set of physical (e.g., server CPU, 
database buffer cache, disk) and logical (e.g., volume, 
external workload) components in the system whose per- 
formance can have an impact on O;’s performance. DI- 
ADS generates dependency paths automatically based on 
the following data: 


e System-wide configuration and connectivity data as 
well as updates to this data collected during the exe- 
cution of each operator (recall Section 3). 

e Domain knowledge of how each database operator 
executes. For example, the dependency path of a sort 
operator that creates temporary tables on disk will be 
different from one that does not create temporaries. 

We distinguish between inner and outer dependency 
paths. The performance of components in O,;’s inner 
dependency path can affect O;’s performance directly. 
O,’s outer dependency path consists of components that 
affect O;’s performance indirectly by affecting the per- 
formance of components on the inner dependency path. 
As an example, the inner dependency path for the Index 
Scan operator in Figure 1 includes the server, HBA, FC- 
Switches, Pool2, Volume v2, and Disks 5-8. The outer 
dependency path will include Volumes v1 and v3 (be- 
cause of the shared disks) and other database queries. 


Pruning dependency paths: The fact that a component C’ 
is in the dependency path of an operator O; does not nec- 
essarily mean that O;’s performance has been affected by 
C’s performance. After generating the dependency paths 
conservatively, DIADS prunes these paths based on cor- 
relation analysis using KDE. 

Recall from Section 3 that the monitoring data col- 
lected by DIADS contains multiple observations of the 
running time of operator O; both when the overall plan 
ran satisfactorily and when the plan ran unsatisfacto- 
rily. For each run of O;, consider the performance data 
collected by DIADS for each component C’ in O,’s de- 
pendency path; this data is collected in the [t,, t-] time 
interval where t, and t,. are respectively O,’s (abso- 
lute) start and stop times for that run. Across all runs, 
this data can be represented as a table with attributes 
A,t(O;),m1,..., Mp. Here, m-m, are performance 
metrics of component C, and the annotation attribute A 
represents whether O;’s running time t(O;) was satis- 
factory or not in the corresponding run. It follows from 
Section 4.1 that we can set A’s value in a record to U 
(denoting unsatisfactory) if prob(S; < t(O;)) > 1-—a; 
and to S otherwise. 

Given the above annotated performance data for an 
(O;,C) operator-component pairing, we can apply cor- 
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Figure 4: Example Codebook 


relation analysis using KDE to identify C’s performance 
metrics that are correlated with the change in O,’s per- 
formance. The details are similar to that in Section 4.1 
except for the following: for some performance metrics, 
observed values lower than the typical range are anoma- 
lous. This correlation can be captured using the condi- 
tion prob(M < v) < a, where M is the random variable 
corresponding to the metric, v is a value observed for VM, 
and @ is a small positive constant. 

In effect, the dependency analysis module will iden- 
tify the set of components that: (i) are part of O,’s de- 
pendency path, and (ii) have at least one performance 
metric that is correlated with the running time of a cor- 
related operator O;. By default, DIADS will only con- 
sider the components in the inner dependency paths of 
correlated operators. However, components in the outer 
dependency paths will be considered if required by the 
symptoms database (Module SD). 

Recall Module CR in the diagnosis workflow where 
DIADS checks for significant correlation between plan 
P’s running time and the record counts of P’s operators. 
DIADS implements this module using KDE in a manner 
almost similar to the use of KDE in dependency analysis; 
hence Module CR is not discussed further. 


4.3 Symptoms Database 


The modules so far in the workflow drilled down from 
the level of the query to that of physical and logical com- 
ponents in the system; in the process identifying corre- 
lated operators and performance metrics. While this in- 
formation is useful, the detected correlations may only 
be symptoms of the true root cause(s) of the query slow- 
down. This issue, which can mask the true root cause(s), 
is generally referred to as the event (fault) propagation 
problem in diagnosis. For example, a change in data 
properties at the database level may, in turn, propagate 
to the volume level causing volume contention, and to 
the server level increasing CPU utilization. In addition, 
some spurious correlations may creep in and manifest 
themselves as unexpected symptoms in spite of our care- 
ful drill down process. 

Objective: DIADS’s Module SD tries to map the ob- 
served symptoms to the actual root cause(s), while deal- 
ing with missing as well as unexpected symptoms arising 
from the noisy nature of production systems. 
Technique: DIADS uses a symptoms database to do the 
mapping. This database streamlines the use of domain 
knowledge in the diagnosis workflow to: 


e Generate more accurate diagnosis results by dealing 
with event propagation. 

e Generate diagnosis results that are semantically more 
meaningful to administrators (for example, reporting 
lock contention as the root cause instead of reporting 
some correlated metrics only). 

We considered a number of formats proposed previously 

in the literature to input domain knowledge for aiding 

diagnosis. Our evaluation criteria were the following: 

I. How easy is the format for administrators to use? 
Here, usage includes customization, maintenance 
over time, as well as debugging. When a diagnosis 
tool pinpoints a particular cause, it is important that 
the administrators are able to understand and validate 
the tool’s reasoning. Otherwise, administrators may 
never trust the tool enough to use it. 

II. Can the format deal with the noisy conditions in 
production systems, including multiple simultane- 
ous problems, presence of spurious correlations, and 
missing symptoms. 

One of the formats from the literature [16] is an expert 

knowledge-base of rules where each rule expresses pat- 

terns or relationships that describe symptoms, and can be 
matched against the monitoring data. Most of the focus 
in this work has been on exact matches, so this format 
scores poorly on Criterion II. Representing relationships 
among symptoms (e.g., event X will cause event Y) us- 
ing deterministic or probabilistic networks like Bayesian 
networks [23] has been gaining currency recently. This 
format has high expressive power, but remains a black- 
box for administrators who find it hard to interpret the 

reasoning process (Criterion I). 

Another format, called the Codebook [34], is very in- 
tuitive as well as implemented in a commercial prod- 
uct. This format assumes a finite set of symptoms such 
that each distinct root cause f has a unique signature 
in this set. That is, there is a unique subset of symp- 
toms that R gives rise to which differs makes it distin- 
guishable from all other root causes. This information is 
represented in the Codebook which is a matrix whose 
columns correspond to the symptoms and rows corre- 
spond to the root causes. A cell is mapped to 1 if the 
corresponding root cause should show the corresponding 
symptom; and to 0 otherwise. Figure 4 shows an exam- 
ple Codebook where there are four hypothetical symp- 
toms symp i—sympz, and three root causes Ri—-R3. 

When presented with a vector V of symptoms seen in 
the system, the Codebook computes the distance d(V, R) 
of V to each row R (i.e., root cause). Any number of dif- 
ferent distance metrics can be used, e.g., Euclidean (L2) 
distance or Hamming distance [34]. d(V, R) is a mea- 
sure of the confidence that FR is a root cause of the prob- 
lem. For example, given a symptoms vector (1, 0,0, 1) 
(i.e., only symp, and symp, are seen), the Euclidean 
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distances to the three root causes in Figure 4 are 0, V/2, 
and | respectively. Hence, f, is the best match. 

The Codebook format does well on both our evalua- 
tion criteria. Codebooks can handle noisy situations, and 
administrators can easily validate the reasoning process. 
However, DIADS needs to consider complex symptoms 
such as symptoms with temporal properties. For exam- 
ple, we may need to specify a symptom where a disk fail- 
ure is seen within X minutes of the first incidence of the 
query slowdown, where X may vary depending on the 
installation. Thus, it is almost impossible in our domain 
to fully enumerate a closed space of relevant symptoms, 
and to specify for each root cause whether each symptom 
from this space will be seen or not. These observations 
led to DIADS’s new design of the symptoms database: 


1. We define a base set of symptoms consisting of: 
(i) operators in the database system that can be in- 
cluded in the correlated set, (ii) performance met- 
rics of components that can be correlated with op- 
erator performance, and (iii) system-monitored and 
user-defined events collected by DIADS. 


2. The language defined by IBM’s Active Correlation 
Technology (ACT) is used to express complex symp- 
toms over the base set of symptoms [2]. The benefit 
of this language comes from its support for a range 
of built-in patterns including filter, collection, dupli- 
cate, computation, threshold, sequence, and timer. 
ACT can express symptoms like: (i) the workload 
on a volume is higher than 200 IOPS, and (ii) event 
£ should follow event /2 in the 30 minutes pre- 
ceding the first instance of query slowdown. 


3. DIADS’s symptoms database is a collection of root 
cause entries each of which has the format Cond, 
& Condz & ... & Cond,z, for some z > 0 which 
can differ across entries. Each Cond; is a Boolean 
condition of the form Ssymp, (denoting presence of 
symp,) or ~4symp, (denoting absence of symp,). 
Here, symp; is some base or complex symptom. 
Each Cond; is associated with a weight w,; such the 
sum of the weights for each individual root cause 
entry is 100%. That is, }>;_, w; = 100%. 








4. Given a vector of base symptoms, DIADS computes 
a confidence score for each root cause entry Ff as the 
sum of the weights of R’s conditions that evaluate 
to true. Thus, the confidence score for R is a value 
in [0%, 100%] equal to S>;_, w;|Cond; = true. 


DIADS’s symptoms database tries to balance the expres- 
sive power of rules with the intuitive structure and robust- 
ness of Codebooks. The symptoms database differs from 
conventional Codebooks in a number of ways. For each 
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root cause entry, DIADS avoids the “closed-world” as- 
sumption for symptoms by mapping symptoms to 0, 1, or 
“don’t care”. Conventional Codebooks are constrained to 
0 or | mappings. DIADS’s symptoms database can con- 
tain mappings for fixes to problems in addition to root 
causes. This feature is useful because it may be easier 
to specify a fix for a query slowdown (e.g., add an in- 
dex) instead of trying to find the root cause. DIADS also 
allows multiple distinct entries for the same root cause. 
Generation of the symptoms database: Companies 
like EMC, IBM, HP, and Oracle are investing signifi- 
cant (currently, mostly manual) effort to create symp- 
toms databases for different subsystems like network- 
ing infrastructure, application servers, and databases 
[34, 19, 24, 9, 10, 11]. Symptoms databases created by 
some of these efforts are already in commercial use. The 
creation of these databases can be partially automated, 
e.g., through a combination of fault injection and ma- 
chine learning [9, 12]. In fact, DIADS’s modules like 
correlation, dependency, and impact analysis can be used 
to identify important symptoms automatically. 


4.4 Impact Analysis 


Objective: The confidence score computed by the symp- 

toms database module for a potential root cause R cap- 

tures how well the symptoms seen in the system match 

the expected symptoms of R. For each root cause R 

whose confidence score exceeds a threshold, the impact 

analysis module computes f’s impact score. If R is an 
actual root cause, then R’s impact score represents the 
fraction of the query slowdown that can be attributed to 

FR individually. DIADS’s novel impact analysis module 

serves three significant purposes: 

e When multiple problems coexist in the system, im- 
pact analysis can separate out high-impact causes 
from the less significant ones; enabling prioritization 
of administrator effort in problem solving. 

e Asa safeguard against misdiagnoses caused by spu- 
rious correlations due to noise. 

e Asan extra check to find whether we have identified 
the right cause(s) or all cause(s). 

Technique: Interestingly, one approach for impact anal- 

ysis is to invert the process of dependency analysis from 

Section 4.2. Let R be a potential root cause whose im- 

pact score needs to be estimated: 

1. Identify the set of components, denoted comp(R), 
that R affects in the inner dependency path of the 
operators in the query plan. DIADS gets this infor- 
mation from the symptoms database. 

2. For each component C’ € comp(R), find the sub- 
set of correlated operators, denoted op(J), such that 
for each operator O in this subset: (i) C' is in O’s 
inner dependency path, and (ii) at least one perfor- 
mance metric of C' is correlated with the change in 


USENIX Association 


O’s performance. DIADS has already computed this 
information in the dependency analysis module. 

3. R’s impact score is the percentage of the change in 
plan running time (query slowdown) that can be at- 
tributed to the change in running time of operators 
in op(R). Here, change in running time is computed 
as the difference between the average running times 
when performance is unsatisfactory and that when 
performance is satisfactory. 

The above approach will work as long as for any pair of 

suspected root causes Ry and Ro, op( Ri) Nop(R2) = 0. 

However, if there are one or more operators common to 

op(R1) and op(R2z) whose running times have changed 

significantly, then the above approach cannot fully sepa- 
rate out the individual impacts of R,; and Ro. 

DIADS addresses the above problem by leveraging 
plan cost models that play a critical role in all database 
systems. For each query submitted to a database system, 
the system will consider a number of different plans, use 
the plan cost model to predict the running time (or some 
other cost metric) of each plan, and then select the plan 
with minimum predicted running time to run the query 
to completion. These cost models have two main com- 
ponents: 

e Analytical formula per operator type (e.g., sort, index 
scan) that estimates the resource usage (e.g., CPU 
and I/O) of the operator based on the values of input 
parameters. While the number and types of input pa- 
rameters depend on the operator type, the main ones 
are the sizes of the input processed by the operator. 

e Mapping parameters that convert resource-usage es- 
timates into running-time estimates. For example, 
IBM DB2 uses two such parameters to convert the 
number of estimated I/Os into a running-time esti- 
mate: (i) the overhead per I/O operation, and (ii) the 
transfer rate of the underlying storage device. 

The following are two examples of how DIADS uses plan 

cost models: 

e Since DIADS collects the old and new record-counts 
for each operator, it estimates the impact score of 
a change in data properties by plugging the new 
record-counts into the plan cost model. 

e When volume contention is caused by an external 
workload, DIADS estimates the new I/O latency of 
the volume from actual observations or the use of de- 
vice performance models. The impact score of the 
volume contention is computed by plugging this new 
estimate into the plan cost model. 

DIADS’s use of plan cost models is a general technique 

for impact analysis, but it is limited by what effects are 

accounted for in the model. For example, if wait times 
for locks are not modeled, then the impact score can- 
not be computed for locking-based problems. Address- 
ing this issue—e.g., by extending plan cost models or by 
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using planned experiments at run time—is an interesting 
avenue for future work. 


5 Experimental Evaluation 


The taxonomy of scenarios considered for diagnosis in 
the evaluation follows from Figure 2. DIADS was used 
to diagnose query slowdowns caused by (i) events within 
the database and the SAN layers, (ii) combinations of 
events across both layers, as well as (iii) multiple con- 
current problems (a capability unique to DIADS). Due to 
space limitations, it is not possible to describe all the sce- 
nario permutations from Figure 2. Instead, we start with 
a scenario and make it increasingly complex by combin- 
ing events across the database and SAN. We consider: 
(1) volume contention caused by SAN misconfiguration, 
(ii) database-level problems (change in data properties, 
contention due to table locking) whose symptoms prop- 
agate to the SAN, and (iii) independent and concurrent 
database-level and SAN-level problems. 

We provide insights into how DIADS diagnoses these 
problems by drilling down to the intermediate results like 
anomaly, confidence, and impact scores. While there is 
no equivalent tool available for comparison with DIADS, 
we provide insights on the results that a database-only 
or SAN-only tool would have generated; these insights 
are derived from hands-on experience with multiple in- 
house and commercial tools used by administrators to- 
day. Within the context of the scenarios, we also report 
sensitivity analysis of the anomaly score to the number 
of historic samples and length of the monitoring interval. 


5.1 Setup Details 


Our experimental testbed is part of a production SAN 
environment, with the interconnecting fabric and stor- 
age controllers being shared by other applications. Our 
experiments ran during low activity time-periods on 
the production environment. The testbed runs data- 
warehousing queries from the popular TPC-H bench- 
mark [29] on a PostgreSQL database server configured to 
access tables using two Ext3 filesystem volumes created 
on an enterprise-class IBM DS6000 storage controller. 
The database server is a 2-way 1.7 GHz IBM xSeries 
machine running Linux (Redhat 4.0 Server), connected 
to the storage controller via Fibre Channel (FC) host bus 
adaptor (HBA). Both the storage volumes are RAID 5 
configurations consisting of (4 + 2P) 15K FC disks. 

An IBM TotalStorage Productivity Center [17] SAN 
management server runs on a separate machine record- 
ing configuration details, statistics, and events from the 
SAN as well as from PostgreSQL (which was instru- 
mented to report the data to the management tool). Fig- 
ure 6 shows the key performance metrics collected from 
the database and SAN. The monitoring data is stored as 
time-series data in a DB2 database. Each module in DI- 
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ADS’s workflow is implemented using a combination of 
Matlab scripts (for KDE) and Java. DIADS uses a symp- 
toms database that was developed in-house to diagnose 
query slowdowns in database over SAN deployments. 

Our experimental results focus on the slowdown of the 
plan shown in Figure 5 for Query 2 from TPC-H. Fig- 
ure 5 shows the 25 operators in the plan, denoted O,- 
O25. In database terminology, the operators Index Scan 
and Sequential Scan are leaf operators since they access 
data directly from the tables; hence the leaf operators are 
the most sensitive to changes in SAN performance. The 
plan has 9 leaf operators. The other operators process 
intermediate results. 


5.2 Scenario 1: Volume Contention due to 
SAN Misconfiguration 


Problem Setting 


In this scenario, a contention is created in volume V1 
(from Figure 5) causing a slowdown in query perfor- 
mance. The root cause of the contention is another ap- 
plication workload that is configured in the SAN to use 
a volume V’ that gets mapped to the same physical disks 
as V1. For an accurate diagnosis result, DIADS needs 
to pinpoint the combination of SAN configuration events 
generated on: (i) creation of the new volume V’, and (ii) 
creation of a new zoning and mapping relationship of the 
server running the workload that accesses V’. 


Module CO 


DIADS analyzes the historic monitoring samples col- 
lected for each of the 25 query operators. The moni- 
toring samples for an operator are labeled as satisfactory 
or unsatisfactory based on past problem reports from the 
administrator. Using the operator running times in these 
labeled samples, Module CO in the workflow uses KDE 
to compute anomaly scores for the operators (recall Sec- 
tion 4.1). Table 1 shows the anomaly scores of the oper- 
ators identified as the correlated operators; these opera- 
tors have anomaly scores > 0.8 (the significance of the 
anomaly scores is covered in Section 4.1). The following 
observations can be made from Table 1: 

e Leaf operators Og and O22 were correctly identified 
as correlated. These two are the only leaf operators 
that access data on the Volume V1 under contention. 

e Eight intermediate operators were ranked highly as 
well. This ranking can be explained by event prop- 
agation where the running times of these operators 
are affected by the running times of the “upstream” 
operators in the plan (in this case Og and O22). 

e A false positive for leaf operator O4 which operates 
on tables in Volume V2. This could be a result of 
noisy monitoring data associated with the operator. 

In summary, Module CO’s KDE analysis has zero false 
negatives and one false positive from the total set of 9 
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Figure 5: Query plan, operators, and dependency paths 
for the experimental results 


leaf operators. The false positive gets filtered out later in 
the symptoms database and impact analysis modules. 

To further understand the anomaly scores, we con- 
ducted a series of sensitivity tests. Figure 7 shows the 
sensitivity of the anomaly scores of three representative 
operators to the number of samples available from the 
satisfactory runs. O22’s score converges quickly to 1 be- 
cause O22’s running time under volume contention is al- 
most 5X the normal. However, the scores for leaf op- 
erator O,; and intermediate operator O; take around 20 
samples to converge. With fewer than these many sam- 
ples, Oi, could have become a false positive. In all our 
results, the anomaly scores of all 25 operators converge 
within 20 samples. While more samples may be required 
in environments with higher noise levels, the relative 
simplicity of KDE (compared to models like Bayesian 
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Figure 6: 
Operator Operator Type Anomaly Score 
O2 Non-leaf 1.00 
O3 Non-leaf 1.00 
Oe Non-leaf 1.00 
O7 Non-leaf 1.00 
Og Leaf (sequential scan) 1.00 
Org Non-leaf 1.00 
O20 Non-leaf 1.00 
O21 Non-leaf 1.00 
O22 Leaf (index scan) 1.00 
O17 Non-leaf 0.969 
O4 Leaf (index scan) 0.965 

















Table 1: Anomaly scores for query operators from Figure 
5 in Scenario 1 
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Figure 7: Sensitivity of anomaly scores to the number of 
satisfactory samples. While O22 shows highly anoma- 
lous behavior, scores for O; and QO, should be low 


networks) keeps this number low. 

Figure 8 shows the sensitivity of Og2’s anomaly score 
to the length of the monitoring interval during a 4-hour 
period. Intuitively, larger monitoring intervals suppress 
the effect of spikes and bursty access patterns. In our ex- 
periments, the query running time was around 4 minutes 
under satisfactory conditions. Thus, monitoring intervals 
of 10 minutes and larger in Figure 8 cause the anomaly 
score to deviate more and more from the true value. 


Module DA 


This module generates and prunes dependency paths for 
correlated operators in order to relate operator perfor- 


Important performance metrics collected by DIADS 
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Figure 8: Sensitivity of anomaly scores to noise in the 
monitoring data 




















Volume, Anomaly Score Anomaly Score 
Perf. Metric | (no contention in V2) | (contention in V2) 
| V1, writelO 0.894 0.894 
| V1, writeTime 0.823 0.823 
| V2, writelO 0.063 0.512 
| V2, writeTime 0.479 0.879 














Table 2: Anomaly scores computed during dependency 
analysis for performance metrics from Volumes V1, V2 


mance to database and SAN component performance. 
For ease of presentation, we will focus on the leaf opera- 
tors in Figure 5 since they are the most sensitive to SAN 
performance. Given the configuration of our experimen- 
tal testbed in Figure 5, the primary difference between 
the dependency paths of various operators is in the vol- 
umes they access: V1 is in the dependency path of Og 
and Ogg, and V2 is in the paths of O4, O11, O14, O16, 
O19, Oo3, and Oo. 


The set of correlated operators from Module CO are 
O4, Og, and Og2. Thus, DIADS will compute anomaly 
scores for the performance metrics of both V1 and V2. 
Table 2’s second column shows the anomaly scores for 
two representative metrics each from V1 and V2. (Table 
2’s third column is described later in this section.) As ex- 
pected, none of V2’s metrics are identified as correlated 
because V2 has no contention; while those of V1 are. 


Module CR 


Anomaly scores are low in this module because data 
properties do not change. 
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Module SD 


The symptoms identified up to this stage are: 
e High anomaly scores for operators dependent on V1. 
e High anomaly scores for V1’s performance metrics. 
e High anomaly score for only one V2-dependent op- 
erator (out of seven such operators). 


These symptoms are strong evidence that V1’s perfor- 
mance is a cause of the query slowdown, and V2’s per- 
formance is not. Thus, even when a symptoms database 
is not available, DIADS correctly narrows down the 
search space an administrator has to consider during di- 
agnosis. An impact analysis will further point out that 
the false positive symptom due to O, has little impact on 
the query slowdown. 

However, without a symptoms database or further di- 
agnosis effort from the administrator, the root cause of 
V1’s change of performance is still unknown among pos- 
sible candidates like: (i) change of performance of an ex- 
ternal workload, (ii) a runaway query in the database, or 
(iii) a RAID rebuild. We will now report results from the 
use of a symptoms database that was developed in-house. 
DIADS uses this database as described in Section 4.3 ex- 
cept that instead of reporting numeric confidence scores 
to administrators, DIADS reports confidence as one of 
High (score > 80%), Medium (80% > score > 50%), 
or Low (50% > score > 0%). The summary of Module 
SD’s output in the current scenario is: 


e All root causes with contention-related symptoms for 
V2 have Low confidence (few symptoms are found). 

e RAID rebuild gets Low confidence because no RAID 
rebuild start or end events are found. 

e V1 contention due to changes in data properties gets 
Low confidence because symptoms are missing. 

e V1 contention due to change in external workload 
gets Low confidence because no external workload 
was on the outer dependency path of a correlated op- 
erator when performance was satisfactory. 

e V1 contention due to change in database workload 
gets Medium confidence because of a weak corre- 
lation between the performance of some correlated 
operators and the rest of the database workload. 

e V1 contention due to the SAN misconfiguration 
problem gets High confidence because all specified 
symptoms are found including: (i) creation of a new 
volume (parametrized with the physical disk infor- 
mation), and (ii) creation of new masking and zoning 
information for the volume. 

The symptoms database had an entry for the actual root 
cause because this problem is common. Hence, DI- 
ADS was able to diagnose the root cause for this sce- 
nario. Note that DIADS had to consider more than 900 
events (system generated as well as user-defined) for the 
database and SAN generated during the course of the sat- 
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isfactory and unsatisfactory runs for this experiment. 
Module IA 


Impact analysis done using the inverse dependency anal- 
ysis technique gives an impact score of 99.8% for the 
high-confidence root cause found. This score is high be- 
cause the slowdown is caused entirely by the contention 
in V1. 

In keeping with our experimental methodology, we 
complicated the problem scenario to test DIADS’s ro- 
bustness. Everything was kept the same except that we 
created extra I/O load on Volume V2 in a bursty manner 
such that this extra load had little impact on the query 
beyond the original impact of V1’s contention. Without 
intrusive tracing, it would not be possible to rule out the 
extra load on V2 as a potential cause of the slowdown. 

Interestingly, DIADS’s integrated approach is still able 
to give the right answer. Compared to the previous sce- 
nario, there will now be some extra symptoms due to 
higher anomaly scores for V2’s performance metrics (as 
shown in the third column in Table 2). However, root 
causes with contention-related symptoms for V2 will still 
have Low confidence because most of the leaf operators 
depending on V2 will have low anomaly scores as before. 
Also, impact scores will be low for these causes. 

Unlike DIADS, a SAN-only diagnosis tool may spot 
higher I/O loads in both V1 and V2, and attribute both of 
these as potential root causes. Even worse, the tool may 
give more importance to V2 because most of the data is 
on V2. A database-only tool can pinpoint the slowdown 
in the operators. However, this tool cannot track the root 
cause down to the SAN level because it has no visibility 
into SAN configuration or performance. From our expe- 
rience, database-only tools may give several false posi- 
tives in this context, e.g., suboptimal bufferpool setting 
or a suboptimal choice of execution plan. 


5.3. Scenario 2: Database-layer Problem 
Propagating to the SAN-layer 


In this scenario we cause a query slowdown by changing 
the properties of the data, causing extra I/O on Volume 
V2. The change is done by an update statement that mod- 
ifies the value of an attribute in some records of the part 
table. The overall size of all tables, including part, are 
unchanged. There are no external causes of contention 
on the volumes. 

Modules CO, DA, and CR behave as expected. In par- 
ticular, module CR correctly identifies all the operators 
whose record-counts show a correlation with plan per- 
formance: operators O;, Oz, O3, and O4 show increased 
record-counts, while operators O; and Og show reduced 
record-counts. The root-cause entry for changes in data 
properties gets High confidence in Module SD because 
all needed symptoms match. All other root-cause entries 
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get Low confidence, including contention due to changes 
in external workload and database workload because no 
correlations are detected on the outer dependency paths 
of correlated operators (as expected). 

The impact analysis module gives the final confirma- 
tion that the change in data properties is the root cause, 
and rules out the presence of high-impact external causes 
of volume contention. As described in Section 4.4, we 
can use the plan cost model from the database to estimate 
the individual impact of any change in data properties. In 
this case, the impact score for the change in data proper- 
ties is 88.31%. Hence, DIADS could have diagnosed the 
root cause of this problem even if the symptoms database 
was unavailable or incomplete. 


5.4 Scenario 3: Concurrent Database- 
layer and SAN-layer Problems 


We complicate Scenario 2 by injecting contention on 
Volume V2 due to SAN misconfiguration along with the 
change in data properties. Both these problems individu- 
ally cause contention in V2. The SAN misconfiguration 
is the higher-impact cause in our testbed. This key sce- 
nario represents the occurrence of multiple, possibly re- 
lated, events at the database and SAN layers, complicat- 
ing the diagnosis process. The expected result from DI- 
ADS is the ability to pinpoint both these events as causes, 
and giving the relative impact of each cause on query 
performance. 

The CO, DA, and CR Modules behave in a fashion 
similar to Scenario 2, and drill down to the contention 
in Volume V2. We considered DIADS’s performance 
in two cases: with and without the symptoms database. 
When the symptoms database is unavailable or incom- 
plete, DIADS cannot distinguish between Scenarios 2 
and 3. However, DIADS’s impact analysis module com- 
putes the impact score for the change in data properties, 
which comes to 0.56%. (This low score is representa- 
tive because the SAN misconfiguration has more than 
10X higher impact on the query performance than the 
change in data properties.) Hence, DIADS final answer 
in this case is as follows: (i) a change in data properties is 
a high-confidence but low-impact cause of the problem, 
and (ii) there are one or more other causes that impact 
V2 which could not be diagnosed. 

When the symptoms database is present, both the ac- 
tual root causes are given High confidence by Module 
SD because the needed symptoms are seen in both cases. 
Thus, DIADS will pinpoint both the causes. Furthermore, 
impact analysis will confirm that the full impact on the 
query performance can be explained by these two causes. 

A database-only diagnosis tool would have success- 
fully diagnosed the change in data properties in both Sce- 
narios 2 and 3. However, the tool may have difficulty 
distinguishing between these two scenarios or pinpoint- 
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ing causes at the SAN layer. A SAN-only diagnosis tool 
will pinpoint the volume overload. However, it will not 
be able to separate out the impacts of the two causes. 
Since the sizes of the tables do not change, we also sus- 
pect that such a tool may even rule out the possibility of 
a change in data properties being a cause. 


5.5 Discussion 


The scenarios described in the experimental evaluation 
were carefully chosen to be simple, but not simplis- 
tic. They are representative of event categories occur- 
ring within the DB and SAN layers as shown in Fig- 
ure 2. We have additionally experimented with different 
events within those categories such as CPU and mem- 
ory contention in the SAN in addition to disk-level satu- 
ration, different types of database misconfiguration, and 
locking-based database problems. Locking-based prob- 
lems are hard to diagnose because they can cause differ- 
ent types of symptoms in the SAN layer, including con- 
tention as well as underutilization. We have also consid- 
ered concurrent occurrence of three or more problems, 
e.g., change in data properties, SAN misconfiguration, 
and locking-based problems. The insights from these ex- 
periments are similar to those seen already, and further 
confirm the utility of an integrated tool. However: 

e High levels of noise in the monitoring data can re- 
duce DIADS’s effectiveness. 

e While DIADS would still be effective when the symp- 
toms database is incomplete, more manual effort will 
be needed to pinpoint actual root causes. 

e Incomplete or inaccurate plan cost models reduce the 
accuracy of impact analysis. 


6 Conclusions and Future Work 


We presented an integrated database and storage diagno- 
sis tool called DIADS. Using a novel combination of 
machine learning techniques with database and storage 
expert domain-knowledge, DIADS accurately identifies 
the root cause(s) of problems in query performance; ir- 
respective of whether the problem occurs in the database 
or the storage layer. This integration enables a more ac- 
curate and efficient diagnosis tool for system adminis- 
trators. Through a detailed experimental evaluation, we 
also demonstrated the robustness of our approach: with 
its ability to deal with concurrent multiple problems as 
well as presence of noisy data. 

In future, we are interested in exploring two direc- 
tions of research. First, we are investigating approaches 
that further strengthen the analysis done as part of DI- 
ADS modules, e.g., techniques that complement database 
query plan models using planned run-time experiments. 
Second, we aim to generalize our diagnosis techniques to 
support applications other than databases in conjunction 
with enterprise storage. 
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Abstract 


We introduce a novel multi-resource allocator to dynam- 
ically allocate resources for database servers running on 
virtual storage. Multi-resource allocation involves pro- 
portioning the database and storage server caches, and 
the storage bandwidth between applications according to 
overall performance goals. The problem is challenging 
due to the interplay between different resources, e.g., 
changing any cache quota affects the access pattern at 
the cache/disk levels below it in the storage hierarchy. 
We use a combination of on-line modeling and sampling 
to arrive at near-optimal configurations within minutes. 
The key idea is to incorporate access tracking and known 
resource dependencies e.g., due to cache replacement 
policies, into our performance model. 

In our experimental evaluation, we use both micro- 
benchmarks and the industry standard benchmarks TPC- 
W and TPC-C. We show that our multi-resource allocation 
approach improves application performance by up to fac- 
tors of 2.9 and 2.4 compared to state-of-the-art single- 
resource controllers, and their ad-hoc combination, re- 
spectively. 


1 Introduction 


With the emerging trend towards server consolidation in 
large data centers, techniques for dynamic resource al- 
location for performance isolation between applications 
become increasingly important. With server consolida- 
tion, operators multiplex several concurrent applications 
on each physical server of a server farm, connected to 
a shared network attached storage (as in Figure 1). As 
compared to traditional environments, where applica- 
tions run in isolation on over-provisioned resources, the 
benefits of server consolidation are reduced costs of man- 
agement, power and cooling. However, multiplexed ap- 
plications are in competition for system resources, such 
as, CPU, memory and disk, especially during load bursts. 
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Moreover, in this shared environment, the system is still 
required to meet per-application performance goals. This 
gives rise to a complex resource allocation and control 
problem. 


Currently, resource allocation to applications in state- 
of-the-art platforms occurs through different perfor- 
mance optimization loops, run independently at dif- 
ferent levels of the software stack, such as, at the 
database server, operating system and storage server, in 
the consolidated storage environment shown in Figure 1. 
Each local controller typically optimizes its own local 
goals, e.g., hit-ratio, disk throughput, etc., oblivious to 
application-level goals. This might lead to situations 
where local, per-controller, resource allocation optima 
do not lead to the global optimum; indeed local goals 
may conflict with each other, or with the per-application 
goals [14]. Therefore, the main challenge in these mod- 
ern enterprise environments is designing a strategy which 
adopts a holistic view of system resources; this strat- 
egy should efficiently allocate all resources to applica- 
tions, and enforce per-application quotas in order to meet 
overall optimization goals e.g., overall application per- 
formance or service provider revenue. 


Unfortunately, the general problem of finding the 
globally optimum partitioning of all system resources, 
at all levels to a given set of applications is an NP- 
hard problem. Complicating the problem are inter- 
dependencies between the various resources. For ex- 
ample, let’s assume the two tier system composed of 
database servers and consolidated storage server as in 
Figure 1, and several applications running on each 
database server instance. For any given application, a 
particular cache quota setting in the buffer pool of the 
database system influences the number and type of ac- 
cesses seen at the storage cache for that application. Par- 
titioning the storage cache, in its turn, influences the ac- 
cess pattern seen at the disk. Hence, even deriving an 
off-line solution, assuming a stable set of applications, 
and available hardware e.g., through profiling, trial and 
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Figure 1: Data Center Infrastructure: We show a typical 
data-center architecture using consolidated storage 


error, etc., by the system administrator, is likely to be 
highly inaccurate, time consuming, or both. 


Due to these problems, with a few exceptions [17, 32], 
previous work has eschewed dynamic resource partition- 
ing policies, in favor of investigating mechanisms for 
enforcing performance isolation, under the assumption 
that per-application quotas, deadlines or priorities are 
predefined e.g., manually, for each given resource type. 
Examples of such mechanisms include CPU quota en- 
forcement [2, 16], memory quota allocation based on 
priorities [3], or I/O quota enforcement between work- 
loads [9, 11, 12]. 

Moreover, typically, previous work investigated en- 
forcing a given resource partitioning of a single re- 
source, within a single software tier at a time. In 
our own previous work in the area of dynamic parti- 
tioning, we have investigated either partitioning mem- 
ory, through a simulation-based exhaustive search ap- 
proach [24], or partitioning storage bandwidth, through 
an adaptive feedback-loop approach [23], but not both. 


In this paper, we consider the problem of global 
resource allocation, which involves proportioning the 
database and storage server caches, and the storage band- 
width among applications, according to overall perfor- 
mance goals. To achieve this, we focus on building a 
simple performance model in order to guide the search, 
by providing a good approximation of the overall so- 
lution. The performance model provides a resource-to- 
performance mapping for each application, in all possi- 
ble resource quota configurations. Our key ideas are to 
incorporate readily available information about the appli- 
cation and system into the performance model, and then 
refine the model through limited experimental sampling 
of actual behavior. Specifically, we reuse and extend on- 
line models for workload characterization, i.e., the miss 
ratio curve (MRC) [32], as well as simplifications based 
on common assumptions about cache replacement poli- 
cies. We further derive a disk latency model for a quanta- 
based disk scheduler [27] and we parametrize the model 
with metrics collected from the on-line system, instead 
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of using theoretical value distributions, thus avoiding the 
fundamental source of inaccuracy in classic analytical 
models [10]. 


Finally, we refine the accuracy of the computed per- 
formance model through experimental sampling. We 
use statistical interpolation between computed and ex- 
perimental sample points in order to re-approximate the 
per-application performance models, thus dynamically 
refining the model. We experimentally show that, by us- 
ing this method, convergence towards near-optimal con- 
figurations can be achieved in mere minutes, while an 
exhaustive exploration of the multi-dimensional search 
space, representing all possible partitioning configura- 
tions, would take weeks, or even months. 


We implement our technique using commodity soft- 
ware and hardware components without any modifica- 
tions to interfaces between components, and with mini- 
mal instrumentation. We use the MySQL database en- 
gine running a set of standard benchmarks, 1.e., the TPC- 
W e-commerce benchmark, and the TPC-C transaction 
processing benchmark. Our experimental testbed is a 
cluster of dual processor servers connected to a commod- 
ity storage hardware. 


We show experiments for on-line convergence to a 
global partitioning solution for sharing the database 
buffer pool, storage cache, and disk bandwidth in dif- 
ferent application configurations. We compare our ap- 
proach to two baseline approaches, which optimize ei- 
ther the memory partitioning, or the disk partitioning, as 
well as combinations of these approaches without global 
coordination. We show that for most application con- 
figurations, our computed model effectively prunes most 
of the search space, even without any additional tuning 
through experimental sampling. Our dynamic resource 
algorithm performs similar to an experimental exhaustive 
search algorithm, but provides a solution within minutes, 
versus days of running time. At the same time, our global 
resource partitioning solution improves application per- 
formance by up to factors of 2.9 and 2.4 compared to 
state-of-the-art single-resource controllers and their ad- 
hoc combination, respectively. 


The remainder of this paper is structured as follows. 
Section 2 provides a background on existing techniques 
for server consolidation in modern data centers, high- 
lighting the need for a global resource allocation solu- 
tion. We describe our multi-resource partitioning algo- 
rithm in Section 3. Section 4 describes our virtual stor- 
age prototype and sampling methodology in detail. Sec- 
tion 5 presents the algorithms we use for comparison, our 
benchmarks, and our experimental methodology, while 
Section 6 presents the results of our experiments on this 
platform. Section 7 discusses related work and Section 8 
concludes the paper. 
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2 Background and Motivation 


In this section, we present and evaluate the state-of-the- 
art in single resource partitioning and we show why these 
techniques are insufficient in themselves. 


2.1 Single Resource Partitioning 


We describe previous work that either allocate the stor- 
age bandwidth, or cache/memory to several applications. 

Storage Bandwidth Partitioning: Several disk 
scheduling policies [11, 12, 27, 29] for enforcing disk 
bandwidth isolation between co-scheduled applications 
have been proposed. We have implemented and com- 
pared the performance isolation guarantees provided by 
the following disk schedulers: (1) Quanta-based schedul- 
ing [27], (2) Start-time Fair Queuing (SFQ) [11], (3) Ear- 
liest Deadline First (EDF), (4) Lottery-based [29] and 
(5) Fagade [12]. Our study [18] shows that the Quanta- 
based scheduler, where each workload is given a quan- 
tum of time for using the disk in exclusive mode, offers 
the best performance isolation level. This is because it 
allows the storage server to exploit the locality in I/O re- 
quests issued by an application during its assigned quan- 
tum, which in turn results in minimizing the effects of 
additional disk seeks due to inter-application interfer- 
ence. However, the existing algorithms discussed above 
assume that the I/O deadlines, or disk bandwidth propor- 
tions are given a priori. In this paper, we study how to 
dynamically determine the bandwidth proportions at run- 
time. Once the bandwidth proportions are determined, 
we use Quanta-based scheduling to enforce the alloca- 
tions, since it provides the strongest isolation guarantees. 

Memory/Cache Partitioning: Dynamic memory par- 
titioning between applications is typically performed us- 
ing the miss ratio curve (MRC) [32]. The MRC repre- 
sents the page miss ratio versus the memory size, and 
can be computed dynamically through Mattson’s Stack 
Algorithm [13]. The algorithm assigns memory incre- 
ments iteratively to the application with the highest pre- 
dicted miss ratio benefit. MRC-based cache partitioning 
thus dynamically partitions the cache/memory to multi- 
ple applications, in such a way to optimize the aggregate 
miss ratio. 


2.2 Motivating Experiment 


We present a simple motivating experiment that shows 
the need for multi-resource allocation. To simplify the 
presentation, we consider only accesses to the storage 
server, hence only the storage cache and the storage 
bandwidth resources. We run two synthetic workloads 
concurrently on the storage server: a small workload 
(Workload-A) with 1 outstanding request, and a large 
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Figure 2: Motivating Results: Comparison of aggregate la- 
tency motivates multi-resource controllers. 


workload (Workload-B) with 10 outstanding requests, at 
any given time. Workload-A is cache friendly and achieves 
a cache hit ratio of 50% with a 1GB storage cache. In 
contrast, Workload-B is mostly un-cacheable; it obtains 
only a 5% hit ratio with a 1GB storage cache. 

We run the workloads using several different configu- 
rations, i.e., uncontrolled sharing, partitioning the cache, 
disk or both between workloads. We normalize the la- 
tency of each workload relative to its latency running in 
isolation. Figure 2 presents our results. In all schemes, 
we use the combined application latencies (by simple 
summation) as the global optimization goal. We choose 
this simple metric for fairness of comparison with the 
miss ratio curve algorithm [32], which optimizes the ag- 
gregate miss ratio, hence the aggregate latency, while be- 
ing agnostic to Service Level Objectives (SLOs) in gen- 
eral. 

When running in isolation, Workload-A is able to uti- 
lize the 1 GB cache effectively and this results in an 
average storage access latency of 4.4ms. On the other 
hand, Workload-B does not benefit from the cache, re- 
sulting in an average storage access latency of 85.1ms. 
When the two workloads are run concurrently with un- 
controlled resource sharing, the larger Workload-B domi- 
nates the smaller Workload-A at both cache and disk levels. 
This results in a factor of 6 slowdown for Workload-A and 
a factor of 4 slowdown for Workload-B. This result shows 
that workloads can suffer significant performance degra- 
dation when resource sharing is not controlled. 

Next, we run the workloads using different resource 
partitioning algorithms. First, we partition the storage 
cache using the miss ratio curves of the workloads [32], 
while disk bandwidth sharing is uncontrolled. The MRC 
algorithm determines that the best cache setting is to allo- 
cate the bulk of the storage cache (992 MB) to Workload- 
A and provide a minimum to Workload-B. Cache par- 
titioning thus improves the performance of Workload-A 
significantly from 26.6ms to 19.9ms. Next, we iterate 
through all possible disk partitioning settings to find the 
best disk bandwidth partitioning between the workloads, 
and enforce it using quanta-based scheduling [27], while 
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cache sharing is uncontrolled. By partitioning the disk 
bandwidth, the performance of Workload-A improves to 
13.2ms. In addition, Workload-B improves to 169.7ms. 
While properly partitioning the resource at each level in- 
dependently, as described above, alleviates the interfer- 
ence, neither partitioning results in the optimal configu- 
ration for these two workloads. 

On the other hand, an exhaustive search of both the 
cache and bandwidth settings yields an ideal setting 
where the storage access latency is 9.64ms for Workload-A 
and 171.3ms for Workload-B. In our simple case, the allo- 
cation solution found by the exhaustive search algorithm 
is just a combination of the solutions found by the two 
independent partitioners, for cache and disk. However, 
as we will show, due to the interdependence between re- 
sources, this is not the case when more resources are con- 
sidered. Finally, iterating through all possible configura- 
tions and taking experimental samples for the exhaustive 
search is clearly infeasible for non-trivial combinations 
of resources and workloads. 

These experiments and observations thus motivate us 
to design and implement a coordinated multi-resource 
partitioning algorithm based on an approximate system 
and application model, which we introduce next. 


3. Dynamic Multi-Resource Allocation 


In this section, we describe our approach to providing 
effective resource partitioning for database servers run- 
ning on virtual storage. Our main objective is to meet 
an overall performance goal, e.g., minimize the overall 
latency, when running a set of database applications on a 
shared storage server. In order to achieve this, we use the 
following: 


1. A performance model based on minimal statistics 
collection in order to approximate a near-optimal 
allocation of resources to applications according to 
our overall goal, and 


2. An experimental sampling and statistical interpola- 
tion technique that refines the initial model. 


In the following, we first introduce the problem state- 
ment, and an overview of our approach. Then, we in- 
troduce our performance model, and its sampling-based 
fine-tuning in detail. 


3.1 Problem Statement 


We study dynamic resource allocation to multiple appli- 
cations in dynamic content servers with shared storage. 
In the most general case, let’s assume that the system 
contains m resources and is hosting n applications. Our 
goal is to find the optimal configuration for partitioning 
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the m resources among the n applications. Let’s de- 
note with r1,72,...,1%n the data access times of the n 
applications hosted by the service provider. For the pur- 
poses of this paper, we assume that the goal of the service 
provider is to minimize the sum of all data access laten- 
cies for all applications, i.e. U = mind", ri. 

However, our approach does not depend on the partic- 
ular goal we set. For example, alternatively, we can op- 
timize the provider’s revenue expressed as a utility func- 
tion based on the application latencies. Whichever goal 
we set, we assume that our algorithm is aware of that 
goal, and can monitor application performance in order 
to compute the total benefit obtained for all applications, 
in any resource quota configuration. 

Finding a practical solution to this problem is diffi- 
cult, because the optimal resource allocation depends on 
many factors, including the (dynamic) access patterns of 
the applications, and how the inner mechanisms of each 
system component e.g., cache replacement policies, af- 
fect inter-dependencies between system resources. 


3.2 Overview of Approach 


Our technique determines per-application resource quo- 
tas in the database and storage caches, on the fly, in a 
transparent manner, with minimal changes to the DBMS, 
and no changes to existing interfaces between compo- 
nents. Towards this objective, we use an online perfor- 
mance estimation algorithm to dynamically determine 
the mapping between any given resource configuration 
setting and the corresponding application latency. While 
designing and implementing a performance model for 
guiding the resource partitioning search is non-trivial, 
our key insight is to design a model with sufficient ex- 
pressiveness to incorporate i) tracking of dynamic access 
patterns, and ii) sufficiently generic assumptions about 
the inner mechanisms of the system components and the 
system as a whole. 

For this purpose we collect a trace of I/O accesses at 
the DBMS buffer pool level and we use periodic sam- 
pling of the average disk latency for each application in 
a baseline configuration, where the application is given 
all the disk bandwidth. We feed the access trace and 
baseline disk latency for each application into a perfor- 
mance model, which computes the latency estimates for 
that application for all possible resource configurations. 
We thus obtain a set of resource-to-performance map- 
ping functions, i.e., performance models, one for each 
application. Next, we enhance the accuracy of each per- 
formance model through experimental sampling. We use 
statistical regression to re-approximate the performance 
model by interpolating between the precomputed and ex- 
perimentally gathered sample points. 

We then use the corresponding per-application perfor- 
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mance models to determine the near-optimal allocation 
of resources to applications according to our overall goal. 
Specifically, we leverage the derived performance model 
of each application, and use hill climbing [21] to con- 
verge towards a partitioning setting that minimizes the 
combined application latencies. In the following sub- 
section, we describe our model that estimates the per- 
formance of an application using multi-level caches and 
a shared disk. 


3.3. Per-Application Performance Model 


We use two key insights about the inner workings of the 
system, as explained next, to derive a close performance 
approximation, while at the same time reducing the com- 
plexity of the model as much as possible. 

Key Assumptions and Ideas: The key assumptions 
we use about the system are i) that the cache replace- 
ment policy used in the cache hierarchy is known to be 
either the standard, uncoordinated LRU, or the coordi- 
nated DEMOTE [31] policy and ii) that the server is a 
closed-loop system i.e., it is interactive and the number 
of users is constant during periods of stable load. Both of 
these assumptions match our target system well, leading 
to a performance model with sufficient accuracy to find 
a near-optimal solution, as we will show in Section 6. 
With the assumptions above, our key idea is to replace 
the search space of a cache hierarchy with the simpler 
search space of a single level of cache, in order to ob- 
tain a close performance estimation, at higher speed, as 
described next. 


3.3.1 Approximate Performance Model 


We approximate the cache hierarchy with the model of a 
single-level cache, and we specialize this model for two 
most commonly deployed, or proposed cache replace- 
ment policies, i.e., uncoordinated LRU and coordinated 
DEMOTE [31]. We also derive a simplified disk model. 
Based on our models, assuming that the application is 
given quotas i.e., fractions p., ps and pa of the buffer 
pool cache, storage cache and disk bandwidth, respec- 
tively, we estimate the overall data access latency for the 
respective quotas through a combination of selective on- 
line measurements and computation. 

In the following, we first introduce an approximation 
of the cache miss ratio of a two-level cache hierarchy, 
M (Pc; Ps), aS a function of the cache quotas p, and ps, 
for the two types of replacement policies we consider. 
Then we introduce our disk model that computes the disk 
latency as a function of the disk quota, Da(pa). Finally, 
we describe our overall data access latency model. 

Modeling the Cache Hierarchy: In a cache hier- 
archy using the standard (uncoordinated) LRU replace- 


ment policy at all levels, any cache miss from cache level 
q; Will result in bringing the needed block into all lower 
levels of the cache hierarchy, before providing the re- 
quested block to cache 7. It follows that the block is 
redundantly cached at all cache levels, which is called 
the inclusiveness property [31]. Therefore, if an applica- 
tion is given a certain cache quota q; at a level of cache 
2, any cache quotas q,; given at any lower level of cache 
J, with q; < q; will be mostly wasteful. 

In contrast, in a cache hierarchy using coordinated 
DEMOTE [31] cache replacement, when a block is 
fetched from disk, it is not kept in any lower cache lev- 
els. The lower cache levels cache blocks only when the 
block is evicted from a higher cache level. Therefore, 
the application benefits from the combined quotas at all 
levels due to cache exclusiveness. Based on these ob- 
servations, we make the following simplifications to ap- 
proximate the overall miss ratio of a two-level cache, i.e., 
M (Pc, Ps), based on a single-level cache model. 

In an uncoordinated LRU cache hierarchy, only the 
maximum size quota given at any level of cache matters; 
therefore, we approximate the miss ratio of a two level 
cache, consisting of a buffer pool (with quota p,) and 
a storage cache (with quota p,;) by the following formula: 


ioe 


M(pe,ps) *~ Me(max[pe, ps]) (1) 


In a coordinated DEMOTE cache hierarchy, the 
combined cache quotas given to the application at all 
levels of cache has the same effect on the overall miss 
ratio as giving the total quota in a single level of cache. 
Therefore, for DEMOTE cache replacement, we use the 
following formula to approximate the miss ratio of a 
two-level cache: 


a 


M(pe,ps) © Me(pet+ ps) (2) 


Modeling the Disk Latency: For modeling the disk 
latency, we observe that the typical server system is an 
interactive, closed-loop system. This means that, even 
if incoming load may vary over time, at any given point 
in time, the rate of serviced requests is roughly equal to 
the incoming request rate. According to the interactive 
response time law [10]: 


la=>-2 (3) 


where Lq is the response time of the storage server, in- 
cluding both I/O request scheduling and the disk access 
latency, N is the number of application threads, X is the 
throughput, and z is the think time of each application 
thread issuing requests to the disk. 
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We then use this formula to derive the average disk 
access latency for each application, when given a cer- 
tain quota of the disk bandwidth. We assume that think 
time per thread is negligible compared to request pro- 
cessing time, i.e., we assume that I/O requests are ar- 
riving relatively frequently, and disk access time is sig- 
nificant. If this is not the case, the I/O component of a 
workload is likely not going to impact overall application 
performance. However, if necessary, more precision can 
be easily afforded e.g., by a context tracking approach, 
which allows the storage server to distinguish requests 
from different application threads [25], hence infer the 
average think time. 

We further observe that the throughput of an applica- 
tion varies proportionally to the fraction of disk band- 
width that the application is given. Since disk satura- 
tion is unlikely in interactive environments with a lim- 
ited number of I/O threads, this is very intuitive, but also 
verified through extensive validation experiments using 
a quanta-based scheduler and a variety of workloads. 

Through a simple derivation, we arrive at the follow- 
ing formula: 


La(1) 
Pa 





La(pa) = (4) 
where Lq(1) is the baseline disk latency for an applica- 
tion, when the entire disk bandwidth is allocated to that 
application. This formula is intuitive. For example, if the 
entire disk was given to the application, i.e., oq = 1, then 
the storage access latency is equal to the underlying disk 
access latency. On the other hand, if the application is 
given a small fraction of the disk bandwidth, i.e, pg ~ 0, 
then the storage access latency is very high (approaches 
oo). 

Finally, the total cache quota allocated to an appli- 
cation influences the arrival rate of I/O requests at the 
disk, hence the baseline disk latency for that applica- 
tion. For example, a larger cache quota may result in 
a smaller disk queue, which in its turn limits opportuni- 
ties for scheduling optimizations to minimize disk seeks. 
Hence, in the absence of disk bandwidth saturation, a 
larger cache quota may result in a higher baseline disk 
latency for the corresponding application. 

Therefore, to compute the baseline disk latency for 
an application given a particular cache configuration, we 
use linear interpolation based on experimental measure- 
ments, taken for a few cache settings, instead of a single 
measurement. 

Computing the Overall Performance Model: As- 
suming that the hit access latency in the buffer pool is 
negligible, the overall latency is determined by the ac- 
cesses that miss in the buffer pool and either i) hit in the 
storage cache or ii) miss in the storage cache, hence ac- 
cess the disk. 
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Assuming that the access latency for a hit/miss in the 
storage cache is approximately the network/disk latency, 
Le., Lnez/Lg, respectively, then the average application 
latency is: 


Lag (Pes Pas Pa) = Me(pe)Hs(Pey Ps) Lnet (3) 
+ Me(pc)Ms (pe, ps) La(pq) 


where the miss (and hit) ratio at the storage cache, i.e., 
M.,(fc; Ps), is a function of both the quota at the first 
level cache (p,), and the quota at the second level cache 
(ps), while the miss ratio of the buffer pool, M.(p-), 
is only a function of p,. We can further approximate 
the fraction of accesses that miss in both levels of cache, 
hence reach the disk, i.e., M-(p-)M-5(c, Ps) from the 
formula above, with the fraction of disk accesses given 
by the miss ratio of our previously introduced single- 
level cache model as: 


poe 


Me(pe)Ms(per Ps) = M(pe; ps) (6) 


By using the previously derived models for M. (Pe; Ps) 
e.g., in the case of uncoordinated LRU (Equation 1), we 
obtain: 


M.-(max[pc, ps]) 


M5(pe, Ps) Maloe) 


(7) 


Therefore, we can approximate the miss ratio in the 
storage cache, M.(.,s), in terms of the miss ratio 
of a single-level cache model. By replacing the respec- 
tive miss/hit ratio of the storage cache in Equation 5, 
we derive the application latency based on our single- 
level cache performance model for either type of cache 
replacement policy. 

Finally, in order to derive a complete resource-to- 
performance model, we perform access trace collection 
and compute the miss ratio curve (MRC) only at the 
buffer pool \evel. Then, we vary the quota allocations for 
the two caches and the disk bandwidth for the applica- 
tion, to all possible combinations in the model. For each 
quota setting, we then compute the corresponding appli- 
cation latencies based on the precomputed buffer pool 
MRC by Equation 5. 

Model Adjustment to Dynamic Changes: The 
model needs periodic recalibration, in order to account 
for load variations. Recalibration involves taking new 
samples of the disk latency for each application in a few 
cache configurations, to recompute the baseline disk la- 
tency. A new application trace needs to be collected and 
the new MRC recomputed only if the application pat- 
tern changes. If a new application is co-scheduled on the 
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same infrastructure, we need to sample and compute the 
performance model only for the new application. 


3.4 Sources of Inaccuracy 


In our simple performance model we ignore the effects 
of locking for concurrency control, dirty block flushes for 
the cache model, and imperfect I/O isolation at small disk 
quanta for the disk model. 

Specifically, whenever a dirty block evicted from the 
buffer pool is flushed to disk, the write access goes 
through all lower levels of cache on its way out. Hence, 
the evicted block remains cached in the storage cache, vi- 
olating our assumption of redundancy for uncoordinated 
LRU caches, hence impacting cache miss ratio predic- 
tions. 

Moreover, for low disk quanta, the disk scheduler 
incurs frequent and potentially large disk seeks be- 
tween the data locations of different applications on disk. 
Thereby, our disk latency prediction, as well as the un- 
derlying I/O bandwidth isolation mechanism itself would 
be inaccurate in this case. In particular, the disk quanta 
cannot be less than the maximum duration of a disk read- 
/write, which is that of a block size of 16KB in our case 
(for MySQL). 


3.5 Model Fine-tuning 


In order to fine-tune our performance model at run 
time, hence adaptively correct any inaccuracies, we use 
more expensive sampling-based approaches to correct 
the model at runtime. We collect experimental samples 
of application latency in various resource partitioning 
configurations, and use statistical regression 1.e., support 
vector machine regression (SVR) [8], to re-approximate 
the resource-to-performance mapping function without 
sampling the search space exhaustively. SVR allows us 
to estimate the performance for configuration settings we 
haven’t actuated, through interpolation between a given 
set of sample points. 

We iteratively collect a set of & randomly selected 
sample points. Each sample represents the average ap- 
plication latency measured in a given configuration. We 
replace the respective points in our performance model 
with the new set of experimentally collected samples. 
Using all sample points, consisting of both computed and 
experimentally collected samples, we retrain the regres- 
sion model. We also cross-validate the model by train- 
ing the regression model on a sub-set of all samples and 
comparing with the regression function obtained using 
the remaining samples. If during cross-validation, we 
determine that the regression-based performance model 
is stable [8], then we conclude that we do not need to 
collect any more samples, and we have achieved a highly 


accurate performance model for the respective applica- 
tion. Otherwise, we iterate through the above process 
until convergence is achieved. 


3.6 Finding the Optimal Configuration 


Based on the per-application performance models de- 
rived as above, we find the resource partitioning set- 
ting which gives the optimum i.e., lowest combined la- 
tency in our case, by using hill climbing with random- 
restarts [21]. The hill climbing algorithm is an iterative 
search algorithm that moves towards the direction of in- 
creasing combined utility value for all valid configura- 
tions at each iteration. To avoid reaching a local opti- 
mum, we conduct several searches from several points 
chosen randomly until each search reaches an optimum. 
We use the best result obtained from all searches. 


4 Prototype Implementation 


Our infrastructure (Akash') consists of a virtual storage 
system prototype designed to run on commodity hard- 
ware. It supports data accesses to multiple virtual vol- 
umes for any storage client, such as, database servers 
and file systems. It uses the Network Block Device 
(NBD) driver packaged with Linux to read and write log- 
ical blocks from the virtual storage system, as shown 
in Figure 3. NBD is a standard storage access proto- 
col similar to iSCSI, supported by Linux. It provides a 
method to communicate with a storage server over the 
network. The client machine (shown in left) mounts 
the virtual volume as a NBD device (e.g., /dev/nbd1) 
which is used by MySQL as a raw disk partition, (e.g., 
/dev/raw/rawl1). We modified existing client and 
server NBD protocol processing modules for the stor- 
age client and server, respectively, in order to interpose 
our storage cache and disk controller modules on the I/O 
communication path, as shown in the figure. 

In addition, we provide interfaces for creating/destroy- 
ing new virtual volumes and setting resource quanta per 
virtual volume. Our infrastructure supports a resource 
controller in charge of partitioning multiple levels of 
storage cache hierarchy and the storage bandwidth. The 
controller determines per-application resource quotas on 
the fly, based on our performance model introduced in 
Section 3, in a transparent manner, with minimal changes 
to the DBMS L.e., to collect access traces at the level of 
the buffer pool and to monitor performance. In addition, 
we modify the MySQL/InnoDB buffer pool to support 
dynamic partitioning and resizing of its buffer pool, since 
it does not currently provide these features. 


! Akash is a Sanskrit word meaning “sky” or “space”. 
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Figure 3: Virtual Storage Architecture: We show one client 
connected to a storage server using NBD. 


4.1 Sampling Methodology 


For each hosted application, and given configuration, in 
order to collect a sample point, we record the average 
and standard deviation of the data access latency, for the 
corresponding application in that configuration. For each 
sample point where we change the cache configuration, 
we wait for cache warm-up, until the application miss 
ratio is stable (which takes approximately 15 minutes on 
average in our experiments). Once the cache is stable, we 
monitor and record the application latency several times 
in order to reduce the noise in measurement. Once mea- 
sured, sample points for an application can also be stored 
as an application surface on disk and later retrieved. 


4.1.1 Efficient Sampling for Exhaustive Search 


For the purpose of exhaustive sampling i.e., for com- 
paring our model to measured optimum configurations 
(see Section 6.3.3), the controller iteratively sets the de- 
sired resource quotas and measures the application la- 
tency during each sampling period. We use the follow- 
ing rules of thumb in order to speed up the exhaustive 
sampling process: 

Cost-aware Iteration: We sort resources in descend- 
ing order of re-partitioning cost i.e., cache repartition- 
ing has higher re-partitioning sampling cost compared to 
the disk due to the need to wait for cache warm-up in 
each new configuration. Therefore, we go through all 
cache partitioning possibilities as the outermost loop of 
our iterative exhaustive search; for each cache setting we 
go through all possible disk bandwidth settings in an in- 
ner loop, thus making fewer changes to stateful resources 
overall. 

Order Reversal: The time to acquire a sample can be 
further reduced by iterating from larger cache quotas to 
smaller cache quotas i.e., from 1024MB to 32MB in a 
1024MB cache. In this case, the cache warm-up of the 
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largest cache quota will be amortized over the sampling 
for all cache quotas for the application. 


5 Evaluation 


In this section, we describe several resource partitioning 
algorithms we use in our evaluation. In addition, we de- 
scribe the benchmarks and methodology we use. 


5.1 Algorithms used in Experiments 


We compare our GLOBAL* resource partitioning scheme, 
where we combine performance estimation and experi- 
mental sampling, with the following resource partition- 
ing schemes. 


1. GLOBAL: Is our resource allocation scheme where 
we use only the performance model. As opposed to 
the GLOBALt scheme, we do not add any runtime 
performance samples. 


2. MRC: Uses MRC to perform cache partitioning in- 
dependently at the buffer pool and the storage cache, 
based on access traces seen at that level. The disk 
bandwidth is equally divided among all applica- 
tions. 


3. DISK: Assigns equal portions of the cache to all ap- 
plications at each level and explores all the possible 
configurations at the disk level. 


4. MRC+DISK: Uses the cache configurations produced 
by the MRC scheme and then explores all the pos- 
sible configurations for partitioning the disk band- 
width. 


5. IDEAL*: Finds the configuration with best overall 
latency by exhaustive search through all possible 
cache and disk partitioning configurations. We al- 
locate the caches in 64MB chunks, and the disk in 
20ms quanta slices, yielding a total of 16 x 16 x5 = 
1280 samples measured for each application. A 
more accurate solution can be obtained at finer grain 
increments, e.g., 32MB chunks, but the experiments 
are estimated to take months in this case. 


5.2 Platform and Methodology 


Our evaluation infrastructure consists of three machines: 
(1) a storage server running Akash to provide virtual 
disks, (2) a database server running MySQL, and (3) a 
load generator for the benchmarks. 

We use three workloads: a simple micro-benchmark, 
called UNIFORM, and two industry-standard benchmarks, 
TPC-W and TPC-C. In our experiments, the benchmarks 
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share both the database and storage server machines, us- 
ing the (default) LRU replacement, and containing 1GB 
of memory each. Cache quotas are allocated in 64MB 
increments, with a minimum of 64MB. Disk quotas are 
allocated as 20ms disk quanta slices. 

We run our Web based applications (TPC-W) on 
a dynamic content infrastructure consisting of the 
Apache web server, the PHP application server and the 
MySQL/InnoDB (version 5.0.24) database engine. We 
run the Apache Web server and MySQL on Dell Pow- 
erEdge SC 1450 with dual Intel Xeon processors running 
at 3.0 Ghz with 2GB of memory. MySQL connects to 
the raw device hosted by the NBD server. We run the 
NBD server on a Dell PowerEdge PE1950 with 8 Intel 
Xeon processors running at 2.8 Ghz with 3GB of mem- 
ory. To maximize I/O bandwidth, we use RAID 0 on 15 
10K RPM 250GB hard disks. 

We configure Akash to use 16KB block size to match 
the MySQL/InnoDB block size. Each workload instance 
uses a different virtual volume: a 32GB virtual disk for 
TPC-C, a 64GB virtual disk for TPC-W, and a 64GB disk 
for UNIFORM. In addition, we use the Linux O_DIRECT 
mode to bypass any OS-level buffer caching and the 
noop I/O scheduler. 


5.2.1 Benchmarks 


UNIFORM: We generate the UNIFORM workload by ac- 
cessing data in an uniformly random order. The behavior 
is controlled by two parameters: the size of the data set 
(d) and the memory working set size (w). We run the 
workload with d=64GB and w=/GB. 

TPc-w: The TPC-W benchmark from the Transaction 
Processing Council [1] is a transactional web benchmark 
designed for evaluating e-commerce systems. Several 
web interactions are used to simulate the activity of a re- 
tail store. The database size is determined by the number 
of items in the inventory and the size of the customer 
population. We use 100K items and 2.8 million cus- 
tomers which results in a database of about 4 GB. We 
use the shopping workload that consists of 20% writes. 
To fully stress our architecture, we run 10 TPC-W in- 
stances in parallel creating a database of 40 GB. 

TPc-c: The TPC-C benchmark [20] simulates a whole- 
sale parts supplier that operates using a number of ware- 
house and sales districts. Each warehouse has 10 sales 
districts and each district serves 3000 customers. The 
workload involves transactions from a number of termi- 
nal operators centered around an order entry environ- 
ment. There are 5 main transactions for: (1) entering 
orders (New Order), (2) delivering orders (Delivery), (3) 
recording payments (Payment), (4) checking the status of 
the orders (Order Status), and (5) monitoring the level of 
stock at the warehouses (Stock Level). Of the 5 transac- 
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Figure 4: Miss Ratio Curves: At the buffer pool for our 
workloads. 


tions, only Stock Level is read only, but constitutes only 
4% of the workload mix. We scale TPC-C by using 128 
warehouses, which gives a database footprint of 32GB. 


6 Results 


We evaluate our approach using the TPC-C and TPC-W in- 
dustry standard benchmarks. We also use the synthetic 
UNIFORM workload. We first characterize our work- 
loads by preliminary experiments showing their com- 
puted MRC at the buffer pool level, then report and com- 
pare the average data access latency, measured at the first 
level cache, for each application, when using different re- 
source partitioning schemes. 


6.1 Miss Ratio Curves 


Figure 4 shows the miss ratio curves at the first level 
cache (buffer pool) for all applications. We can see that 
TPC-W and TPC-C are more cacheable than UNIFORM. 
UNIFORM has comparatively higher miss ratios, and it 
benefits greatly from larger cache allocations. On the 
other hand, TPC-W and TPC-C are less affected by cache 
allocations past 128MB. 


6.2. Overall Performance 


We run either identical workload instances, or different 
workload instances, concurrently, on our infrastructure, 
and compare the performance of our partitioning algo- 
rithms. Figures 5-8 show the latency of each applica- 
tion after each partitioner produces a solution. We also 
show the respective partitioning solutions, and the time 
in which they were achieved by each resource partitioner 
(we include the time to collect a reliable access trace in 
the timing for our algorithms, although this is overlapped 
with normal application execution). 

We notice the following overall trends in our results. 
Our GLOBAL* partitioner arrives at the same partition- 
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Figure 5: Identical Instances: Comparison for UNIFORM. 


ing solution as, and provides identical performance to 
IDEAL*, at a fraction of the cost. The performance of 
the GLOBAL partitioner, based only on the computational 
model, is relatively close to the ideal performance as 
well. GLOBAL registers significant improvements with 
experimental sampling only for workload combinations 
that include TPC-C, an application with a substantial 
fraction of writes. Moreover, with one exception, our 
GLOBAL partitioner is both faster and generates better 
partitioning settings than the combination of single re- 
source controllers i.e., the MRC+DISK partitioner. 

The single resource partitioning schemes, i.e., MRC 
and DISK, are limited in their ability to control perfor- 
mance. For example, DISK is ineffective for cache-bound 
workloads (see Figures 5, 6, 7). A more subtle point is 
that in some cases, the poor choices made by the MRC 
scheme can be corrected by providing more disk band- 
width to disadvantaged applications in the MRC+DISK 
scheme. 

We discuss our performance results in detail next and 
we examine the accuracy of our model and its refine- 
ments in Section 6.3. 


6.2.1 Identical Workload Instances 


First, we look at cases where we run two instances of the 
same application. Figure 5 presents our results for the 
UNIFORM/UNIFORM configuration. The results for TPC- 
C/TPC-C and TPC-W/TPC-W are similar. 

In these experiments, the miss ratio curves of 
the two applications are identical. Thus, the 
MRC/MRC+DISK/DISK schemes choose to partition the 
cache levels equally at both the client and storage caches. 
With this setting, due to cache inclusiveness, the second 
level cache, i.e., the storage cache, provides little bene- 
fit, resulting in poor performance for these partitioners. 
For the results shown in Figure 5, our GLOBAL scheme, 
finds a resource partitioning setting of 64MB/960MB 
and 960MB/64MB between the two instances of UNI- 
FORM, at the buffer pool and storage caches respectively. 
This setting provides a much better cache usage scenario 
than equal partitioning of the two caches. 
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Scheme B.Pool S.Cache Quanta Time 
TPC-W UNIF WwW U WwW U (mins) 
GLOBAL 64 960 896 128 40 60 16 
GLOBALt 64 960 896 128 40 60 59 
MRC 128 896 384 640 50 50 32 
DISK 512 512 512 512 40 60 5 
MRC+DISK 128 896 384 640 40 60 37 
IDEAL* 64 960 896 128 40 60 3660 
(b) Allocation 
Figure 6: TPC-W/UNIFORM: Comparison for TPC-W (W) 
and UNIFORM (U) run concurrently. 
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Scheme B.Pool S.Cache Quanta Time 
TPC-C UNIF Cc U Cc U (mins) 
GLOBAL 64 960 896 128 40 60 16 
GLOBALt 64 960 512 512 40 60 760 
MRC 128 896 512 512 50 50 32 
DISK 512 512 512 512 40 60 5 
MRC+DISK 128 896 512 512 40 60 37 
IDEAL* 64 960 512 512 40 60 3660 


(b) Allocation 


Figure 7: TPC-C/UNIFORM: Comparison for TPC-C (C) and 
UNIFORM (U) run concurrently. 


Overall, GLOBAL provides the same partitioning solu- 
tion as IDEAL* and obtains a factor of 2.4 speedup over 
MRC+DISK. For the experiments with two instances of 
TPC-W and TPC-C, GLOBAL obtains a factor of 1.05 and 
1.5 speedup, respectively, over MRC+DISK. 


6.2.2 Different Workload Instances 


Figures 6-8 present our results for different concurrent 
workloads. The results show that the allocations cho- 
sen by the GLOBAL partitioner are non-trivial, and good 
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Scheme B.Pool S.Cache Quanta Time 

| TPC-W  TPC-C WwW Cc WwW Cc (mins) 
GLOBAL 192 960 896 128 60 40 16 
GLOBALt+ 256 768 768 256 60 40 760 
MRC 384 640 384 640 | 50 50 32 
DISK 512 512 | 512 512 50 50 5 
MRC+DISK 384 640 384. 640 | 60 40 37 
IDEAL* 256 768 768 256 60 40 3660 
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Figure 8: TPC-W/TPC-C: Comparison for TPC-W (W) and 
TPC-C (C) run concurrently. 


performance is obtained only when the settings of all re- 
sources are considered. 


First, we examine the TPC-W/UNIFORM configuration, 
shown in Figure 6. The UNIFORM workload has both 
larger cache and disk requirements than TPC-W. Since 
the miss ratio curve of UNIFORM is steeper than that 
of TPC-W, once the first 128MB is allocated to TPC-wW, 
the MRC partitioner allocates the rest of the buffer pool 
(896MB) to UNIFORM. However, UNIFORM is penalized 
by the 50/50 disk bandwidth partitioning in this case. 
On the other hand, the DISK partitioner selects a 60/40 
disk bandwidth allocation in favor of UNIFORM. But, di- 
viding the caches 50/50 results in poor performance for 
this partitioner. The MRC+DISK scheme corrects the disk 
quanta allocation of the MRC scheme. However, due to 
the underlying uncoordinated LRU replacement policy, 
it fails to obtain a synergistic configuration for the two 
caches. Therefore, GLOBAL performs a factor of 1.12 
better than MRC+DISK, by obtaining a better cache con- 
figuration overall, in addition to allocating the disk band- 
width in favor of UNIFORM. GLOBAL performs a factor of 
1.29 better than MRC, and a factor of 2.61 better than 
DISK. 

Next, we look at the TPC-C/UNIFORM configuration, 
shown in Figure 7. The results are similar to the TPC- 
W/UNIFORM configuration, with one exception. The 
model for our GLOBAL partitioner mispredicts the cache 
behavior at the storage cache. The assumption about 
block redundancy between the buffer pool and storage 
cache does not hold for TPC-C, an application with a sub- 
stantial fraction of writes. Hence, allocating more stor- 
age cache to TPC-C, as in the solutions of all other par- 


titioners is beneficial, resulting in increased hit rates in 
this cache. The DISK and MRC partitioners under-perform 
for the same reason as before i.e., because allocating ei- 
ther cache or disk resources 50/50 penalizes UNIFORM. 
Hence, GLOBALt performs a factor of 1.14, and 2.29 
better than MRC, and DISK, respectively, and similar to 
MRC+DISK. 

Finally, we study the TPC-W/TPC-C configuration, 
shown in Figure 8. As the miss ratio curve for TPC-C 
is slightly steeper than TPC-W, the MRC partitioner al- 
locates a larger fraction of the buffer pool (640MB) to 
TPC-C. Moreover, the miss ratio curves of the two ap- 
plications are similar to each other at the storage cache 
level. Therefore, the same greedy MRC cache algorithm 
allocates a larger fraction of the storage cache (640MB) 
to TPC-C as well. This results in over-allocation of to- 
tal cache space to TPC-C, severely penalizing TPC-W, 
when compared to the cache configuration, and perfor- 
mance achieved by IDEAL* (and our GLOBALt). Allocat- 
ing a larger disk fraction to TPC-W in MRC+DISK com- 
pensates for the poor cache partitioning of MRC alone. 
The GLOBAL*t scheme allocates a larger proportion of the 
storage cache than GLOBAL to TPC-C, correcting the ini- 
tial mis-prediction, while still balancing the allocation at 
the two caches for avoiding redundancy, hence provid- 
ing overall better performance. As a result, GLOBAL* is 
a factor of 2.89 better than MRC, a factor of 1.72 better 
than DISK, and a factor of 1.51 better than MRC+DISK. 


6.3 Performance Model Accuracy 


In this section, we evaluate the accuracy of our cache 
and disk approximations in our performance model. In 
addition, we present results for online refinement of our 
model through experimental sampling. 


6.3.1 Two-level Cache Approximation 


We evaluate the accuracy of the two-level cache miss ra- 
tio prediction. Figure 9 presents our results for TPC-W 
and TPC-C. We first provide a detailed analysis for TPC- 
W, for three buffer pool size (64MB, 256MB, 512MB) 
and a range of storage cache sizes, where we plot two 
cache miss ratio curves: experimentally measured (solid 
lines) and predicted by model (dashed lines). As we can 
see, the predicted and measured miss ratio curves are 
close together, hence, our cache approximation is accu- 
rate in calculating the miss ratio at the storage cache. The 
areas of inaccuracy, where the relative error is greater 
than 2%, occur when the storage cache is equal to the 
buffer pool size i.e., 512M. The replacement policy is af- 
fected by concurrency control i.e., through the fix/unfix 
of buffer blocks and some other thread optimizations to 
mitigate cache pollution for table scans, in this case. 
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Figure 9: Two-level Cache Approximation: Errors for cache configurations with TPC-W and TPC-C. Figure 9a shows the 
Measured and Predicted miss ratio curves for buffer pool sizes (64MB,256MB,512MB). Figures 9b-9c show error heatmaps where 
light/dark colors represent low/high error, respectively. The magnitude of the error is shown in the legend on the right. 


We further present the error of our model as a more 
general heat-map, where low errors (0-20%) are shown 
in light colors, whereas higher errors are shown in darker 
colors, for a wide range of cache configurations, for both 
our benchmarks. For both benchmarks, the area of any 
significant inaccuracy is where the two cache sizes are 
equal, especially for large cache sizes. However, these 
very configurations are unlikely to be used as an allo- 
cation solution, because they correspond to a high level 
of redundancy for uncoordinated two-level LRU caches. 
Moreover, for high cache sizes, the miss ratio of most 
applications is low, hence the error is less relevant. The 
errors are higher for TPC-C due to its large fraction of 
writes, hence unpredictable hits in the storage cache for 
dirty blocks previously evicted from the buffer pool. For 
both benchmarks, the error falls below 2% when the stor- 
age cache is at least a factor of 2 larger than the buffer 
pool size. 


6.3.2 Quanta-based Scheduler Approximation 


We evaluate the accuracy of our disk latency approxima- 
tion, when using a quanta-based scheduler (Equation 4). 
We plot both the predicted and the measured disk latency, 
for each application, by varying the storage bandwidth 
quanta. Figure 10a and Figure 10b present our results 
for TPC-C and TPC-W, respectively. In each graph, we 
plot and compare two lines: measured (solid lines) and 
predicted (dashed lines), for different cache sizes (given 
mostly at the buffer pool). 

Overall, the predicted disk latency significantly devi- 
ates from the measured latency only for small quanta val- 
ues. Moreover, slightly higher errors can be observed for 
higher cache sizes. In both of these cases, the explanation 
is the higher variability of the average disk latency over 
time when i) the underlying disk bandwidth isolation is 
less effective due to frequent switching between work- 
loads and ii) disk scheduling optimizations are less ef- 


7th USENIX Conference on File and Storage Technologies 


100 
















































































128 (M) —#— 
80 & 128 (P) --a-- _] 
> . 512 (M) —a— 
E 60 512 (P) --a-- 
> 960 (M) —e— 
38 40 960 (P) --e-- 
sj 
20 
; ae 
0 0.2 0.4 0.6 0.8 
Disk Quota 
(a) TPC-C 
50 
128 (M) —=— 
40 128 (P) --a-- _| 
a { 960 (M) —e— 
— 30 960 (P) ---e-- | 
2 \ 
8 20 
Ss 
4 ft 
10 
0 
0 0.2 0.4 0.6 0.8 
Disk Quota 
(b) TPC-W 


Figure 10: Accuracy of Quanta Scheduler Approximation: 
We plot the Predicted and Measured disk latency by varying the 
disk scheduler quota, in different cache configurations, from 
128MB cache to 960MB cache. 


fective and reliable due to fewer requests in the scheduler 
queue. Moreover, our model ignores the “think time” be- 
tween successive requests for the same workload. On the 
other hand, we can see that our model successfully cap- 
tures the latency deviations due to changes in the cache 
size for TPC-C. 


6.3.3 Model Refinement with Runtime Sampling 


As shown, our model is inaccurate in very localized ar- 
eas of the total search space, where inaccuracies may not 
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Figure 11: Online Sampling: We refine our model accuracy 
at runtime with experimental sampling. 


matter, or can be improved by experimental sampling. 
Figure 11 shows the accuracy improvement through on- 
line performance sampling. In the x-axis we show the 
number of samples added to our performance model ex- 
perimentally, and on the y-axis we show the error be- 
tween the predicted and the actual latencies. For both 
TPC-W and TPC-C, adding samples by online sampling 
significantly reduces the error rate from 72% to 16% for 
TPC-C and from 38% to 10% for TPC-W. 


7 Related Work 


Previous related work has focused on dynamic alloca- 
tion and/or controlling either memory allocation or disk 
bandwidth partitioning among competing workloads. 
Dynamic Memory Partitioning: Dynamic memory 
allocation algorithms have been studied in the VMWare 
ESX server [28]. The algorithm estimates the working- 
set sizes of each VM and periodically adjusts each 
VM’s memory allocation such that performance goals 
are met. Adaptive cache management based on applica- 
tion patterns or query classes has been extensively stud- 
ied in database systems. For example, the DBMIN al- 
gorithm [7] uses the knowledge of the various patterns 
of queries to allocate buffer pool memory efficiently. In 
addition, many cache replacement algorithms have been 
studied e.g., LRU-k [15], in the presence of concurrent 
workloads. LRU-k prevents useful buffer pages from be- 
ing evicted due to sequential scans running concurrently. 
Brown et al. [3] study schemes to ensure per-class re- 
sponse time goals in a system executing queries of mul- 
tiple classes by sizing the different memory regions. Fi- 
nally, recently, IBM DB2 added the self-tuning memory 
manager (STMM) to size different memory regions [26]. 
Disk Bandwidth Partitioning: Dynamic allocation 
of the disk bandwidth has been studied to provide QoS at 
the storage server. Just like in our prototype, SLEDS [5], 
Facade [12], SFQ [11], and Argon [27] place a schedul- 
ing tier above the existing disk scheduler in order to con- 
trol the I/Os issued to the underlying disk. However, 
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these techniques assume that proportions are known e.g., 
set manually. However, more recent techniques, e.g., 
Cello [22], YFQ [4] and Fahrrad [19] build QoS-aware 
disk schedulers, which make low-level scheduling deci- 
sions that strive to minimize seek times as well as main- 
tain quality of service. 

Multi-resource Partitioning: Multi-resource parti- 
tioning is an emerging area of research where multiple 
resources are partitioned to provide isolation and QoS 
for several competing applications. Wachs et al. [27] 
show the benefit of considering both cache allocation 
and disk bandwidth allocation to improve the perfor- 
mance in shared storage servers. However, the resource 
allocation is done after modelling applications through 
extensive profiling. Chanda et al. [6] implement pri- 
ority scheduling at the web and database server lev- 
els. Wang et al. [30] extend the SFQ [11] algorithm to 
several storage servers. Padala et al. [17] study meth- 
ods to allocate memory and CPU to several virtual ma- 
chines located within the same physical server. How- 
ever, these papers focus on either i) dynamic partitioning 
and/or quota enforcement of a single resource on mul- 
tiple machines [6, 30] or ii) allocation of multiple re- 
sources within a single machine [17, 27]. In our study, 
we have shown that global resource partitioning of mul- 
tiple resources located at different tiers results in signifi- 
cant performance gains. 


8 Conclusions 


Resource allocation to applications on the fly is increas- 
ingly desirable in shared data centers with server consol- 
idation. While many techniques for enforcing a known 
allocation exist, dynamically finding the appropriate per- 
resource application quotas has received less attention. 
The challenge is the exponential growth of the search 
space for the optimal solution with the number of ap- 
plications and resources. Hence, exhaustively evaluating 
application performance for all possible configurations 
experimentally is infeasible. 

Our contribution is an effective multi-resource al- 
location technique based on a unified resource-to- 
performance model incorporating i) pre-existing generic 
knowledge about the system and inter-dependencies be- 
tween system resources e.g., due to cache replacement 
policies and ii) application access tracking and baseline 
system metrics captured on-line. 

We show through experiments using several standard 
e-commerce benchmarks and synthetic workloads that 
our performance model is sufficiently accurate in order 
to converge towards a near-optimal global partitioning 
solution within minutes. At the same time, our per- 
formance model effectively optimizes high-level perfor- 
mance goals, providing up to factors of 2.9 and 2.4 im- 
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provement compared to state-of-the-art single-resource 
controllers, and their ad-hoc combination, respectively. 
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Abstract 


Rapid adoption of virtualization technologies has led to 
increased utilization of physical resources, which are mul- 
tiplexed among numerous workloads with varying demands 
and importance. Virtualization has also accelerated the de- 
ployment of shared storage systems, which offer many ad- 
vantages in such environments. Effective resource manage- 
ment for shared storage systems is challenging, even in re- 
search systems with complete end-to-end control over all 
system components. Commercially-available storage arrays 
typically offer only limited, proprietary support for control- 
ling service rates, which is insufficient for isolating work- 
loads sharing the same storage volume or LUN. 

To address these issues, we introduce PARDA, a novel 
software system that enforces proportional-share fairness 
among distributed hosts accessing a storage array, without 
assuming any support from the array itself. PARDA uses 
latency measurements to detect overload, and adjusts issue 
queue lengths to provide fairness, similar to aspects of flow 
control in FAST TCP. We present the design and implemen- 
tation of PARDA in the context of VMware ESX Server, 
a hypervisor-based virtualization system, and show how it 
can be used to provide differential quality of service for 
unmodified virtual machines while maintaining high effi- 
ciency. We evaluate the effectiveness of our implementa- 
tion using quantitative experiments, demonstrating that this 
approach is practical. 


1 Introduction 


Storage arrays form the backbone of modern data centers 
by providing consolidated data access to multiple applica- 
tions simultaneously. Deployments of consolidated storage 
using Storage Area Network (SAN) or Network-Attached 
Storage (NAS) hardware are increasing, motivated by easy 
access to data from anywhere at any time, ease of backup, 
flexibility in provisioning, and centralized administration. 
This trend is further fueled by the proliferation of virtualiza- 
tion technologies, which rely on shared storage to support 
features such as live migration of workloads across hosts. 

A typical virtualized data center consists of multi- 
ple physical hosts, each running several virtual machines 
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(VMs). Many VMs may compete for access to one or more 
logical units (LUNs) on a single storage array. The result- 
ing contention at the array for resources such as controllers, 
caches, and disk arms leads to unpredictable IO comple- 
tion times. Resource management mechanisms and policies 
are required to enable performance isolation, control service 
rates, and enforce service-level agreements. 

In this paper, we target the problem of providing coarse- 
grained fairness to VMs, without assuming any support 
from the storage array itself. We also strive to remain work- 
conserving, so that the array is utilized efficiently. We fo- 
cus on proportionate allocation of IO resources as a flexible 
building block for constructing higher-level policies. This 
problem is challenging for several reasons, including the 
need to treat the array as an unmodifiable black box, unpre- 
dictable array performance, uncertain available bandwidth, 
and the desire for a scalable decentralized solution. 

Many existing approaches [13, 14, 16,21, 25, 27, 28] al- 
locate bandwidth among multiple applications running on 
a single host. In such systems, one centralized scheduler 
has complete control over all requests to the storage system. 
Other centralized schemes [19, 30] attempt to control the 
queue length at the device to provide tight latency bounds. 
Although centralized schedulers are useful for host-level IO 
scheduling, in our virtualized environment we need an ap- 
proach for coordinating IO scheduling across multiple inde- 
pendent hosts accessing a shared storage array. 

More decentralized approaches, such as Triage [18], 
have been proposed, but still rely on centralized measure- 
ment and control. A central agent adjusts per-host band- 
width caps over successive time periods and communicates 
them to hosts. Throttling hosts using caps can lead to sub- 
stantial inefficiency by under-utilizing array resources. In 
addition, host-level changes such as VMs becoming idle 
need to propagate to the central controller, which may cause 
a prohibitive increase in communication costs. 

We instead map the problem of distributed storage ac- 
cess from multiple hosts to the problem of flow control in 
networks. In principle, fairly allocating storage bandwidth 
with high utilization is analogous to distributed hosts trying 
to estimate available network bandwidth and consuming it 
in a fair manner. The network is effectively a black box to 
the hosts, providing little or no information about its current 
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state and the number of participants. Starting with this loose 
analogy, we designed PARDA, a new software system that 
enforces coarse-grained proportional-share fairness among 
hosts accessing a storage array, while still maintaining high 
array utilization. 

PARDA uses the IO latency observed by each host as an 
indicator of load at the array, and uses a control equation 
to adjust the number of IOs issued per host, i.e., the host 
window size. We found that variability in IO latency, due 
to both request characteristics (e.g., degree of sequentiality, 
reads vs. writes, and IO size) and array internals (e.g., re- 
quest scheduling, caching and block placement) could be 
magnified by the independent control loops running at each 
host, resulting in undesirable divergent behavior. 

To handle such variability, we found that using the av- 
erage latency observed across all hosts as an indicator of 
overall load produced stable results. Although this approach 
does require communication between hosts, we need only 
compute a simple average for a single metric, which can 
be accomplished using a lightweight, decentralized aggre- 
gation mechanism. PARDA also handles idle VMs and 
bursty workloads by adapting per-host weights based on 
long-term idling behavior, and by using a local scheduler 
at the host to handle short-term bursts. Integrating with a 
local proportional-share scheduler [10] enables fair end-to- 
end access to VMs in a distributed environment. 

We implemented a complete PARDA prototype in the 
VMware ESX Server hypervisor [24]. For simplicity, we 
assume all hosts use the same PARDA protocol to ensure 
fairness, a reasonable assumption in most virtualized clus- 
ters. Since hosts run compatible hypervisors, PARDA can 
be incorporated into the virtualization layer, and remain 
transparent to the operating systems and applications run- 
ning within VMs. We show that PARDA can maintain 
cluster-level latency close to a specified threshold, provide 
coarse-grained fairness to hosts in proportion to per-host 
weights, and provide end-to-end storage IO isolation to 
VMs or applications while handling diverse workloads. 

The next section presents our system model and goals 
in more detail. Section 3 develops the analogy to network 
flow control, and introduces our core algorithm, along with 
extensions for handling bursty workloads. Storage-specific 
challenges that required extensions beyond network flow 
control are examined in Section 4. Section 5 evaluates our 
implementation using a variety of quantitative experiments. 
Related work is discussed in section 6, while conclusions 
and directions for future work are presented in Section 7. 


2 System Model 


PARDA was designed for distributed systems such as the 
one shown in Figure |. Multiple hosts access one or more 
storage arrays connected over a SAN. Disks in storage ar- 


86 7th USENIX Conference on File and Storage Technologies 


rays are partitioned into RAID groups, which are used to 
construct LUNs. Each LUN is visible as a storage device to 
hosts and exports a cluster filesystem for distributed access. 
A VM disk is represented by a file on one of the shared 
LUNs, accessible from multiple hosts. This facilitates mi- 
gration of VMs between hosts, avoiding the need to transfer 
disk state. 

Since each host runs multiple virtual machines, the IO 
traffic issued by a host is the aggregated traffic of all its 
VMs that are currently performing IO. Each host maintains 
a set of pending IOs at the array, represented by an issue 
queue. This queue represents the IOs scheduled by the host 
and currently pending at the array; additional requests may 
be pending at the host, waiting to be issued to the storage 
array. Issue queues are typically per-LUN and have a fixed 
maximum issue queue length! (e.g., 64 10s per LUN). 
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Figure 1: Storage array accessed by distributed hosts/VMs. 


IO requests from multiple hosts compete for shared re- 
sources at the storage array, such as controllers, cache, in- 
terconnects, and disks. As a result, workloads running on 
one host can adversely impact the performance of work- 
loads on other hosts. To support performance isolation, re- 
source management mechanisms are required to specify and 
control service rates under contention. 

Resource allocations are specified by numeric shares, 
which are assigned to VMs that consume IO resources.” A 
VM is entitled to consume storage array resources propor- 
tional to its share allocation, which specifies the relative im- 
portance of its IO requests compared to other VMs. The IO 
shares associated with a host is simply the total number of 
per-VM shares summed across all of its VMs. Proportional- 
share fairness is defined as providing storage array service 
to hosts in proportion to their shares. 

In order to motivate the problem of IO scheduling across 
multiple hosts, consider a simple example with four hosts 
running a total of six VMs, all accessing a common shared 
LUN over a SAN. Hosts 1 and 2 each run two Linux VMs 
configured with OLTP workloads using Filebench [20]. 


'The terms queue length, queue depth, and queue size are used inter- 
changeably in the literature. In this paper, we will also use the term window 
size, which is common in the networking literature. 

Shares are alternatively referred to as weights in the literature. Al- 
though we use the term VM to be concrete, the same proportional-share 
framework can accommodate other abstractions of resource consumers, 
such as applications, processes, users, or groups. 
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Host | VM Types | si, 52 VM1 VM2 Th 
1 2xOLTP | 20,10 | 823 Ops/s | 413 Ops/s | 1240 
2 2xOLTP 10, 10 | 635 Ops/s | 635 Ops/s | 1250 
3 1x Micro 20 710 IOPS n/a 710 
4 1x Micro 10 730 IOPS n/a 730 




















Table 1: Local scheduling does not achieve inter-host fairness. 
Four hosts running six VMs without PARDA. Hosts 1 and 2 each 
run two OLTP VMs, and hosts 3 and 4 each run one micro- 
benchmark VM issuing 16 KB random reads. Configured shares 
(s;), Filebench operations per second (Ops/s), and IOPS (7), for 
hosts) are respected within each host, but not across hosts. 


Hosts 3 and 4 each run a Windows Server 2003 VM with 
Iometer [1], configured to generate 16 KB random reads. 
Table 1 shows that the VMs are configured with different 
share values, entitling them to consume different amounts 
of IO resources. Although a local start-time fair queuing 
(SFQ) scheduler [16] does provide proportionate fairness 
within each individual host, per-host local schedulers alone 
are insufficient to provide isolation and proportionate fair- 
ness across hosts. For example, note that the aggregate 
throughput (in IOPS) for hosts 1 and 2 is quite similar, de- 
spite their different aggregate share allocations. Similarly, 
the Iometer VMs on hosts 3 and 4 achieve almost equal per- 
formance, violating their specified 2 : 1 share ratio. 


Many units of allocation have been proposed for sharing 
IO resources, such as Bytes/s, IOPS, and disk service time. 
Using Bytes/s or IOPS can unfairly penalize workloads with 
large or sequential IOs, since the cost of servicing an IO 
depends on its size and location. Service times are difficult 
to measure for large storage arrays that service hundreds of 
IOs concurrently. 


In our approach, we conceptually partition the array 
queue among hosts in proportion to their shares. Thus two 
hosts with equal shares will have equal queue lengths, but 
may observe different throughput in terms of Bytes/s or 
IOPS. This is due to differences in per-IO cost and schedul- 
ing decisions made within the array, which may process 
requests in the order it deems most efficient to maximize 
aggregate throughput. Conceptually, this effect is similar 
to that encountered when time-multiplexing a CPU among 
various workloads. Although workloads may receive equal 
time slices, they will retire different numbers of instruc- 
tions due to differences in cache locality and instruction- 
level parallelism. The same applies to memory and other 
resources, where equal hardware-level allocations do not 
necessarily imply equal application-level progress. 


Although we focus on issue queue slots as our primary 
fairness metric, each queue slot could alternatively repre- 
sent a fixed-size IO operation (e.g., 16 KB), thereby provid- 
ing throughput fairness expressed in Bytes/s. However, a 
key benefit of managing queue length instead of throughput 
is that it automatically compensates workloads with lower 
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per-IO costs at the array by allowing them to issue more 
requests. By considering the actual cost of the work per- 
formed by the array, overall efficiency remains higher. 

Since there is no central server or proxy performing IO 
scheduling, and no support for fairness in the array, a per- 
host flow control mechanism is needed to enforce speci- 
fied resource allocations. Ideally, this mechanism should 
achieve the following goals: (1) provide coarse-grained 
proportional-share fairness among hosts, (2) maintain high 
utilization, (3) exhibit low overhead in terms of per-host 
computation and inter-host communication, and (4) control 
the overall latency observed by the hosts in the cluster. 

To meet these goals, the flow control mechanism must 
determine the maximum number of IOs that a host can keep 
pending at the array. A naive method, such as using static 
per-host issue queue lengths proportional to each host’s IO 
shares, may provide reasonable isolation, but would not be 
work-conserving, leading to poor utilization in underloaded 
scenarios. Using larger static issue queues could improve 
utilization, but would increase latency and degrade fairness 
in overloaded scenarios. 

This tradeoff between fairness and utilization suggests 
the need for a more dynamic approach, where issue queue 
lengths are varied based on the current level of contention 
at the array. In general, queue lengths should be increased 
under low contention for work conservation, and decreased 
under high contention for fairness. In an equilibrium state, 
the queue lengths should converge to different values for 
each host based on their share allocations, so that hosts 
achieve proportional fairness in the presence of contention. 


3 IO Resource Management 


In this section we first present the analogy between flow 
control in networks and distributed storage access. We then 
explain our control algorithm for providing host-level fair- 
ness, and discuss VM-level fairness by combining cluster- 
level PARDA flow control with local IO scheduling at hosts. 


3.1 Analogy to TCP 


Our general approach maps the problem of distributed stor- 
age management to flow control in networks. TCP running 
at a host implements flow control based on two signals from 
the network: round trip time (RTT) and packet loss proba- 
bility. RTT is essentially the same as IO request latency 
observed by the IO scheduler, so this signal can be used 
without modification. 

However, there is no useful analog of network packet 
loss in storage systems. While networking applications ex- 
pect dropped packets and handle them using retransmission, 
typical storage applications do not expect dropped IO re- 
quests, which are rare enough to be treated as hard failures. 
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Thus, we use IO latency as our only indicator of con- 
gestion at the array. To detect congestion, we must be able 
to distinguish underloaded and overloaded states. This is 
accomplished by introducing a latency threshold parame- 
ter, denoted by . Observed latencies greater than # 
may trigger a reduction in queue length. FAST TCP, a 
recently-proposed variant of TCP, uses packet latency in- 
stead of packet loss probability, because loss probability 
is difficult to estimate accurately in networks with high 
bandwidth-delay products [15]. This feature also helps in 
high-bandwidth SANs, where packet loss is unlikely and 
TCP-like AIMD (additive increase multiplicative decrease) 
mechanisms can cause inefficiencies. We use a similar 
adaptive approach based on average latency to detect con- 
gestion at the array. 

Other networking proposals such as RED [9] are based 
on early detection of congestion using information from 
routers, before a packet is lost. In networks, this has the 
added advantage of avoiding retransmissions. However, 
most proposed networking techniques that require router 
support have not been adopted widely, due to overhead and 
complexity concerns; this is analogous to the limited QoS 
support in current storage arrays. 


3.2 PARDA Control Algorithm 


The PARDA algorithm detects overload at the array based 
on average IO latency measured over a fixed time period, 
and adjusts the host’s issue queue length (i.e., window size) 
in response. A separate instance of the PARDA control al- 
gorithm executes on each host. 

There are two main components: latency estimation and 
window size computation. For latency estimation, each host 
maintains an exponentially-weighted moving average of IO 
latency at time t, denoted by L(t), to smooth out short-term 
variations. The weight given to past values is determined 
by a smoothing parameter a € [0,1]. Given a new latency 
observation /, 


L(t) =(1-a@) x1 + a@xL(t-1) (1) 


The window size computation uses a control mechanism 
shown to exhibit stable behavior for FAST TCP: 


wer =n) + o(wo+B) 
(‘) 
Here w(t) denotes the window size at time t, y € [0,1] is a 
smoothing parameter, 2 is the system-wide latency thresh- 
old, and f is a per-host parameter that reflects its IO shares 
allocation. 

Whenever the average latency L > &, PARDA decreases 
the window size. When the overload subsides and L < 2%, 
PARDA increases the window size. Window size adjust- 
ments are based on latency measurements, which indicate 
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load at the array, as well as per-host B values, which specify 
relative host IO share allocations. 

To avoid extreme behavior from the control algorithm, 
w(t) is bounded by [Winin, Wmax]. The lower bound wy in pre- 
vents starvation for hosts with very few IO shares. The up- 
per bound wy, avoids very long queues at the array, limit- 
ing the latency seen by hosts that start issuing requests after 
a period of inactivity. A reasonable upper bound can be 
based on typical queue length values in uncontrolled sys- 
tems, as well as the array configuration and number of hosts. 

The latency threshold corresponds to the response 
time that is considered acceptable in the system, and the 
control algorithm tries to maintain the overall cluster-wide 
latency close to this value. Testing confirmed our expecta- 
tion that increasing the array queue length beyond a certain 
value doesn’t lead to increased throughput. Thus, can be 
set to a value which is high enough to ensure that a suffi- 
ciently large number of requests can always be pending at 
the array. We are also exploring automatic techniques for 
setting this parameter based on long-term observations of 
latency and throughput. Administrators may alternatively 
specify & explicitly, based on cluster-wide requirements, 
such as supporting latency-sensitive applications, perhaps 
at the cost of under-utilizing the array in some cases. 

Finally, B is set based on the IO shares associated with 
the host, proportional to the sum of its per-VM shares. It 
has been shown theoretically in the context of FAST TCP 
that the equilibrium window size value for different hosts 
will be proportional to their B parameters [15]. 

We highlight two properties of the control equation, 
again relying on formal models and proofs from FAST TCP. 
First, at equilibrium, the throughput of host i is proportional 
to B;/gi, where B; is the per-host allocation parameter, and 
qi is the queuing delay observed by the host. Second, for a 
single array with capacity C and latency threshold 2, the 
window size at equilibrium will be: 

CL 
vj Bj 

To illustrate the behavior of the control algorithm, we 
simulated a simple distributed system consisting of a sin- 
gle array and multiple hosts using Yacsim [17]. Each host 
runs an instance of the algorithm in a distributed manner, 
and the array services requests with latency based on an ex- 
ponential distribution with a mean of 1/C. We conducted 
a series of experiments with various capacities, workloads, 
and parameter values. 

To test the algorithm’s adaptability, we experimented 
with three hosts using a 1 : 2:3 share ratio, 7 = 200 ms, and 
an array capacity that changes from 400 req/s to 100 req/s 
halfway through the experiment. Figure 2 plots the through- 
put, window size and average latency observed by the hosts 
for a period of 200 seconds. As expected, the control al- 
gorithm drives the system to operate close to the desired 


wi = Bi + Bj (3) 
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Figure 2: Simulation of three hosts with | : 2 : 3 share ratio. Array capacity is reduced from 400 to 100 req/s at t = 100 s. 


latency threshold &. We also used the simulator to verify 
that as # is varied (100 ms, 200 ms and 300 ms), the sys- 
tem latencies operate close to #, and that windows sizes 
increase while maintaining their proportional ratio. 


3.3. End-to-End Support 


PARDA flow control ensures that each host obtains a fair 
share of storage array capacity proportional to its IO shares. 
However, our ultimate goal for storage resource manage- 
ment is to provide control over service rates for the appli- 
cations running in VMs on each host. We use a fair queu- 
ing mechanism based on SFQ [10] for our host-level sched- 
uler. SFQ implements proportional-sharing of the host’s is- 
sue queue, dividing it among VMs based on their IO shares 
when there is contention for the host-level queue. 


Two key features of the local scheduler are worth noting. 
First, the scheduler doesn’t strictly partition the host-level 
queue among VMs based on their shares, allowing them 
to consume additional slots that are left idle by other VMs 
which didn’t consume their full allocation. This handles 
short-term fluctuations in the VM workloads, and provide 
some statistical multiplexing benefits. Second, the sched- 
uler doesn’t switch between VMs after every IO, instead 
scheduling a group of IOs per VM as long as they exhibit 
some spatial locality (within a few MB). These techniques 
have been shown to improve overall IO performance [3, 13]. 


Combining a distributed flow control mechanism with a 
fair local scheduler allows us to provide end-to-end IO al- 
locations to VMs. However, an interesting alternative is to 
apply PARDA flow control at the VM level, using per- VM 
latency measurements to control per-VM window sizes di- 
rectly, independent of how VMs are mapped to hosts. This 
approach is appealing, but it also introduces new challenges 
that we are currently investigating. For example, per-VM 
allocations may be very small, requiring new techniques 
to support fractional window sizes, as well as efficient dis- 
tributed methods to compensate for short-term burstiness. 
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3.4 Handling Bursts 


A well-known characteristic of many IO workloads is a 
bursty arrival pattern—fluctuating resource demand due to 
device and application characteristics, access locality, and 
other factors. A high degree of burstiness makes it difficult 
to provide low latency and achieve proportionate allocation. 

In our environment, bursty arrivals generally occur at 
two distinct time scales: systematic long-term ON-OFF be- 
havior of VMs, and sudden short-term spikes in IO work- 
loads. To handle long-term bursts, we modify the B value 
for a host based on the utilization of queue slots by its resi- 
dent VMs. Recall that the host-level parameter f is propor- 
tional to the sum of shares of all VMs (if s; are the shares 
assigned to VM i, then for host h, 8B, = K x ¥;s;, where K 
is a normalization constant). 

To adjust B, we measure the average number of outstand- 
ing IOs per VM, nz, and each VM’s share of its host window 
Size aS Wz, expressed as: 


Sk 
Wwe = ——w(t (4) 
c Lisi ) 
If (ne < wz), we scale the shares of the VM to be 
& = ny X 5;/wx and use this to calculate B for the host. 


Thus if a VM is not fully utilizing its window size, we re- 
duce the B value of its host, so other VMs on the same host 
do not benefit disproportionately due to the under-utilized 
shares of a colocated idle VM. In general, when one or 
more VMs become idle, the control mechanism will allow 
all hosts (and thus all VMs) to proportionally increase their 
window sizes and exploit the spare capacity. 

For short-term fluctuations, we use a burst-aware local 
scheduler. This scheduler allows VMs to accumulate a 
bounded number of credits while idle, and then schedule 
requests in bursts once the VM becomes active. This also 
improves overall IO efficiency, since requests from a single 
VM typically exhibit some locality. A number of schedulers 
support bursty allocations [6, 13,22]. Our implementation 
uses SFQ as the local scheduler, but allows a bounded num- 
ber of IOs to be batched from each VM instead of switching 
among VMs purely based on their SFQ request tags. 
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4 Storage-Specific Challenges 


Storage devices are stateful and their throughput can be 
quite variable, making it challenging to apply the latency- 
based flow control approaches used in networks. Equilib- 
rium may not be reached if different hosts observe very dif- 
ferent latencies during overload. Next we discuss three key 
issues to highlight the differences between storage and net- 
work service times. 


Request Location. It is well known that the latency of a 
request can vary from a fraction of a millisecond to tens of 
milliseconds, based on its location compared to previous re- 
quests, as well as caching policies at the array. Variability in 
seek and rotational delays can cause an order of magnitude 
difference in service times. This makes it difficult to esti- 
mate the baseline IO latency corresponding to the latency 
with no queuing delay. Thus a sudden change in average la- 
tency or in the ratio of current values to the previous average 
may or may not be a signal for overload. Instead, we look 
at average latency values in comparison to a latency thresh- 
old # to predict congestion. The assumption is that laten- 
cies observed during congestion will have a large queuing 
delay component, outweighing increases due to workload 
changes (e.g., sequential to random). 


Request Type. Write IOs are often returned to the host 
once the block is written in the controller’s NVRAM. Later, 
they are flushed to disk during the destage process. How- 
ever, read IOs may need to go to disk more often. Similarly, 
two requests from a single stream may have widely vary- 
ing latencies if one hits in the cache and the other misses. 
In certain RAID systems [5], writes may take four times 
longer than reads due to parity reads and updates. In gen- 
eral, IOs from a single stream may have widely-varying re- 
sponse times, affecting the latency estimate. Fortunately, a 
moving average over a sufficiently long period can absorb 
such variations and provide a more consistent estimate. 


IO Size. Typical storage IO sizes range from 512 bytes to 
256 KB, or even | MB for more recent devices. The estima- 
tor needs to be aware of changing IO size in the workload. 
This can be done by computing latency per 8 KB instead 
of latency per IO using a linear model with certain fixed 
costs. Size variance is less of an issue in networks since 
most packets are broken into MTU-size chunks (typically 
1500 bytes) before transmission. 

All of these issues essentially boil down to the problem 
of estimating highly-variable latency and using it as an in- 
dicator of array overload. We may need to distinguish be- 
tween latency changes caused by workload versus those due 
to the overload at the array. Some of the variation in IO la- 
tency can be absorbed by long-term averaging, and by con- 
sidering latency per fixed IO size instead of per IO request. 
Also, a sufficiently high baseline latency (the desired oper- 
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ating point for the control algorithm, -) will be insensitive 
to workload-based variations in under-utilized cases. 


4.1 Distributed Implementation Issues 


We initially implemented PARDA in a completely dis- 
tributed manner, where each host monitored only its own 
IO latency to calculate L(t) for Equation 2 (referred to as 
local latency estimation). However, despite the use of av- 
eraging, we found that latencies observed at different hosts 
were dependent on block-level placement. 

We experimented with four hosts, each running one Win- 
dows Server 2003 VM configured with a 16 GB data disk 
created as a contiguous file on the shared LUN. Each VM 
also has a separate 4 GB system disk. The storage array 
was an EMC CLARiiON CX3-40 (same hardware setup as 
in Section 5). Each VM executed a 16 KB random read IO 
workload. Running without any control algorithm, we no- 
ticed that the hosts observed average latencies of 40.0, 34.5, 
35.0 and 39.5 ms, respectively. Similarly, the throughput 
observed by the hosts were 780, 910, 920 and 800 IOPS re- 
spectively. Notice that hosts two and three achieved better 
IOPS and lower latency, even though all hosts were issuing 
exactly the same IO pattern. 

We verified that this discrepancy is explained by place- 
ment: the VM disks (files) were created and placed in or- 
der on the underlying device/LUN, and the middle two vir- 
tual disks exhibited better performance compared to the two 
outer disks. We then ran the control algorithm with latency 
threshold & = 30 ms and equal B for all hosts. Figure 3 
plots the computed window size, latency and throughput 
over a period of time. The discrepancy in latencies observed 
across hosts leads to divergence in the system. When hosts 
two and three observe latencies smaller than , they in- 
crease their window size, whereas the other two hosts still 
see latencies higher than “, causing further window size 
decreases. This undesirable positive feedback loop leads to 
a persistent performance gap. 

To validate that this effect is due to block placement 
of VM disks and array level scheduling, we repeated the 
same experiment using a single 60 GB shared disk. This 
disk file was opened by all VMs using a “multi-writer” 
mode. Without any control, all hosts observed a through- 
put of ~ 790 IOPS and latency of 39 ms. Next we ran with 
PARDA on the shared disk, again using equal B and Y = 30 
ms. Figure 4 shows that the window sizes of all hosts are 
reduced, and the cluster-wide latency stays close to 30 ms. 

This led us to conclude that, at least for some disk sub- 
systems, latency observations obtained individually at each 
host for its IOs are a fragile metric that can lead to diver- 
gences. To avoid this problem, we instead implemented a 
robust mechanism that generates a consistent signal for con- 
tention in the entire cluster, as discussed in the next section. 
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Figure 3: Local L(t) Estimation. Separate VM disks cause window size divergence due to block placement and unfair array scheduling. 
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Figure 4: Local L(t) Estimation. VMs use same shared disk, stabilizing window sizes and providing more uniform throughput and latency. 


4.2 Latency Aggregation 


After experimenting with completely decentralized ap- 
proaches and encountering the divergence problem detailed 
above, we implemented a more centralized technique to 
compute cluster-wide latency as a consistent signal. The ag- 
gregation doesn’t need to be very accurate, but it should be 
reasonably consistent across hosts. There are many ways to 
perform this aggregation, including approximations based 
on statistical sampling. We discuss two different techniques 
that we implemented for our prototype. 


Network-Based Aggregation. Each host uses a UDP 
socket to listen for statistics advertised by other hosts. The 
statistics include the average latency and number of IOs per 
LUN. Each host either broadcasts its data on a common sub- 
net, or sends it to every other host individually. This is an 
instance of the general average- and sum-aggregation prob- 
lem for which multicast-based solutions also exist [29]. 


Filesystem-Based Aggregation. Since we are trying to 
control access to a shared filesystem volume (LUN), it is 
convenient to use the same medium to share the latency 
statistics among the hosts. We implement a shared file per 
volume, which can be accessed by multiple hosts simulta- 
neously. Each host owns a single block in the file and peri- 
odically writes its average latency and number of IOs for the 
LUN into that block. Each host reads that file periodically 
using a single large IO and locally computes the cluster- 
wide average to use for window size estimation. 
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In our experiments, we have not observed extremely high 
variance across per-host latency values, although this seems 
possible if some workloads are served primarily from the 
storage array’s cache. In any case, we do not anticipate that 
this would affect PARDA stability or convergence. 


5 Experimental Evaluation 


In this section, we present the results from a detailed 
evaluation of PARDA in a real system consisting of up to 
eight hosts accessing a shared storage array. Each host is 
a Dell Poweredge 2950 server with 2 Intel Xeon 3.0 GHz 
dual-core processors, 8 GB of RAM and two Qlogic HBAs 
connected to an EMC CLARiiON CX3-40 storage array 
over a Fibre Channel SAN. The storage volume is hosted 
on a 10-disk RAID-5 disk group on the array. 

Each host runs the VMware ESX Server hypervisor [24] 
with a local instance of the distributed flow control al- 
gorithm. The aggregation of average latency uses the 
filesystem-based implementation described in Section 4.2, 
with a two-second update period. All PARDA experiments 
used the smoothing parameters a = 0.002 and y = 0.8. 

Our evaluation consists of experiments that examine five 
key questions: (1) How does average latency vary with 
changes in workload? (2) How does average latency vary 
with load at the array? (3) Can the PARDA algorithm adjust 
issue queue lengths based on per-host latencies to provide 
differentiated service? (4) How well can this mechanism 
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handle bursts and idle hosts? (5) Can we provide end- 
to-end IO differentiation using distributed flow control to- 
gether with a local scheduler at each host? 

Our first two experiments determine whether average la- 
tency can be used as a reliable indicator to detect overload 
at the storage array, in the presence of widely-varying work- 
loads. The third explores how effectively our control mod- 
ule can adjust host queue lengths to provide coarse-grained 
fairness. The remaining experiments examine how well 
PARDA can deal with realistic scenarios that include work- 
load fluctuations and idling, to provide end-to-end fairness 
to VMs. Throughout this section, we will provide data using 
a variety of parameter settings to illustrate the adaptability 
and robustness of our algorithm. 


5.1 Latency vs. Workload 


I/O Size vs. Sequentiality 
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Figure 5: Latency as a function of IO size and sequentiality. 


We first consider a single host running one VM execut- 
ing different workloads, to examine the variation in average 
latency measured at the host. A Windows Server 2003 VM 
running Jometer [1] is used to generate each workload, con- 
figured to keep 8 IOs pending at all times. 

We varied three workload parameters: reads — 0 to 100%, 
IO size — 4, 32, 64, and 128 KB, and sequentiality — 0 
to 100%. For each combination, we measured throughput, 
bandwidth, and the average, min and max latencies. 

Over all settings, the minimum latency was observed for 
the workload consisting of 100% sequential 4 KB reads, 
while the maximum occurred for 100% random 128 KB 
writes. Bandwidth varied from 8 MB/s to 177 MB/s. These 
results show that bandwidth and latency can vary by more 
than a factor of 20 due solely to workload variation. 

Figure 5 plots the average latency (in ms) measured for a 
VM while varying IO size and degree of sequentiality. Due 
to space limitations, plots for other parameters have been 
omitted; additional results and details are available in [11]. 

There are two main observations: (1) the absolute la- 
tency value is not very high for any configuration, and (2) 
latency usually increases with IO size, but the slope is small 
because transfer time is usually dominated by seek and ro- 
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Figure 6: Overall bandwidth and latency observed by multiple 
hosts as the number of hosts is increased from | to 5. 
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Workload Phasel Phase2 
Size Read Random| Q T | L Q T | L 


16K 70% 60%) 32 | 1160 | 26 | 16 640 | 24 
16K 100% 100%] 32 880 | 35 | 32 | 1190 | 27 
8K 75% 0%} 32 | 1280 | 25 | 16 890 | 17 
8K 90% 100%} 32 900 | 36 | 32 | 1240 | 26 


Table 2: Throughput (T IOPS) and latencies (L ms) observed by 
four hosts for different workloads and queue lengths (Q). 





























tational delays. This suggests that array overload can be 
detected by using a fairly high latency threshold value. 


5.2 Latency vs. Queue Length 


Next we examine how IO latency varies with increases in 
overall load (queue length) at the array. We experimented 
with one to five hosts accessing the same array. Each host 
generates a uniform workload of 16 KB IOs, 67% reads 
and 70% random, keeping 32 IOs outstanding. Figure 6 
shows the aggregate throughput and average latency ob- 
served in the system, with increasing contention at the array. 
Throughput peaks at three hosts, but overall latency contin- 
ues to increase with load. Ideally, we would like to operate 
at the lowest latency where bandwidth is high, in order to 
fully utilize the array without excessive queuing delay. 

For uniform workloads, we also expect a good correla- 
tion between queue size and overall throughput. To verify 
this, we configured seven hosts to access a 400 GB volume 
on a 5-disk RAID-5 disk group. Each host runs one VM 
with an 8 GB virtual disk. We report data for a workload 
of 32 KB IOs with 67% reads, 70% random and 32 [Os 
pending. Figure 7 presents results for two different static 
host-level window size settings: (a) 32 for all hosts and (b) 
16 for hosts 5, 6 and 7. 

We observe that the VMs on the throttled hosts receive 
approximately half the throughput (~ 42 IOPS) compared 
to other hosts (~ 85 IOPS) and their latency (~ 780 ms) 
is doubled compared to others (~ 360 ms). Their reduced 
performance is a direct result of throttling, and the increased 
latency arises from the fact that a VM’s IOs were queued at 
its host. The device latency measured at the hosts (as op- 
posed to in the VM, which would include time spent in host 
queues) is similar for all hosts in both experiments. The 
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Figure 7: VM bandwidth and latency observed when queue length Q = 32 for all hosts, and when Q = 16 for some hosts. 


overall latency decreases when one or more hosts are throt- 
tled, since there is less load on the array. For example, in 
the second experiment, the overall average latency changes 
from ~ 470 ms at each host to ~ 375 ms at each host when 
the window size is 16 for hosts 5, 6, and 7. 

We also experimented with four hosts sending different 
workloads to the array while we varied their queue lengths 
in two phases. Table 2 reports the workload description and 
corresponding throughput and latency values observed at 
the hosts. In phase 1, each host has a queue length of 32 
while in phase 2, we lowered the queue length for two of 
the hosts to 16. This experiment demonstrates two impor- 
tant properties. First, overall throughput reduces roughly in 
proportion to queue length. Second, if a host is receiving 
higher throughput at some queue length Q due to its work- 
load being treated preferentially, then even for a smaller 
queue length Q/2, the host still obtains preferential treat- 
ment from the array. This is desirable because overall effi- 
ciency is improved by giving higher throughput to request 
streams that are less expensive for the array to process. 


5.3. PARDA Control Method 


In this section, we evaluate PARDA by examining fair- 
ness, latency threshold effects, robustness with non-uniform 
workloads, and adaptation to capacity changes. 


5.3.1 Fairness 


We experimented with identical workloads accessing 16 GB 
virtual disks from four hosts with equal B values. This is 
similar to the setup that led to divergent behavior in Fig- 
ure 3. Using our filesystem-based aggregation, PARDA 
converges as desired, even in the presence of different la- 
tency values observed by hosts. Table 3 presents results for 
this workload without any control, and with PARDA using 
equal shares for each host; plots are omitted due to space 
constraints. With PARDA, latencies drop, making the over- 
all average close to the target &. The aggregate throughput 
achieved by all hosts is similar with and without PARDA, 
exhibiting good work-conserving behavior. This demon- 
strates that the algorithm works correctly in the simple case 
of equal shares and uniform workloads. 
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Uncontrolled PARDA = 30 ms 
Host IOPS | Latency (ms) | B | IOPS | Latency (ms) 
1 780 41 1 730 34 
2 900 34 1 890 29 
3 890 35 1 930 29 
4 790 40 1 800 33 
Aggregate 3360 Avg = 37 3350 Avg = 31 




















Table 3: Fairness with 16 KB random reads from four hosts. 


Next, we experimented with a share ratio of 1:1:2:2 
for four hosts, setting & = 25 ms, shown in Figure 8. 
PARDA converges on windows sizes for hosts 1 and 2 that 
are roughly half those for hosts 3 and 4, demonstrating good 
fairness. The algorithm also successfully converges laten- 
cies to @. Finally, the per-host throughput levels achieved 
while running this uniform workload also roughly match the 
specified share ratio. The remaining differences are due to 
some hosts obtaining better throughput from the array, even 
with the same window size. This reflects the true IO costs 
as seen by the array scheduler; since PARDA operates on 
window sizes, it maintains high efficiency at the array. 


5.3.2 Effect of Latency Threshold 


Recall that @ is the desired latency value at which the array 
provides high throughput but small queuing delay. Since 
PARDA tries to operate close to #, an administrator can 
control the overall latencies in a cluster, bounding IO times 
for latency-sensitive workloads such as OLTP. We investi- 
gated the effect of the threshold setting by running PARDA 
with different “ values. Six hosts access the array concur- 
rently, each running a VM with a 16 GB disk performing 
16 KB random reads with 32 outstanding IOs. 





Host | IOPS | Latency (ms) || Host | IOPS | Latency (ms) 
1 525 59 4 560 57 
2 570 55 5 430 77 
3 570 55 6 500 62 























Table 4: Uncontrolled 16 KB random reads from six hosts. 


We first examine the throughput and latency observed in 
the uncontrolled case, presented in Table 4. In Figure 9, 
we enable the control algorithm with @ = 30 ms and equal 
shares, stopping one VM each at times t = 145 s, t = 220s 
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Figure 8: PARDA Fairness. Four hosts each run a 16 KB random read workload with B values of 1 : 1 : 2: 2. Window sizes allocated by 
PARDA are in proportion to 8 values, and latency is close to the specified threshold # = 25 ms. 
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Figure 9: PARDA Adaptation. Six hosts each run a 16 KB random read workload, with equal 8 values and # = 30 ms. VMs are stopped 
att = 145 s, t = 220s andt = 310s, and window sizes adapt to reflect available capacity. 


and t = 310 s. Comparing the results we can see the effect 
of the control algorithm on performance. Without PARDA, 
the system achieves a throughput of 3130 IOPS at an aver- 
age latency of 60 ms. With # = 30 ms, the system achieves 
a throughput of 3150 IOPS, while operating close to the la- 
tency threshold. Other experiments with different threshold 
values, such as those shown in Figure 10 (¥ = 40 ms) and 
Figure 12 (Y = 25 ms), confirm that PARDA is effective at 
maintaining latencies near 2. 

These results demonstrate that PARDA is able to con- 
trol latencies by throttling IO from hosts. Note the different 
window sizes at which hosts operate for different values of 
£. Figure 9(a) also highlights the adaptation of window 
sizes, aS More capacity becomes available at the array when 
VMs are turned off at various points in the experiment. The 
ability to detect capacity changes through changes in la- 
tency is an important dynamic property of the system. 


5.3.3. Non-Uniform Workloads 


To test PARDA and its robustness with mixed workloads, 
we ran very different workload patterns at the same time 
from our six hosts. Table 5 presents the uncontrolled case. 
Next, we enable PARDA with & = 40 ms, and assign 
shares ina2:1:2:1:2: 1 ratio for hosts | through 6 re- 
spectively, plotted in Figure 10. Window sizes are differen- 
tiated between hosts with different shares. Hosts with more 
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Host | Size Read Random | IOPS | Latency (ms) 
1 4K 100% 100% 610 51 
2 8K 50% 0% 660 48 
3 8K 100% 100% 630 50 
4 8K 67% 60% 670 47 
5 16K 100% 100% 490 65 
6 16K 15% 10% 520 60 














Table 5: Uncontrolled access by mixed workloads from six hosts. 


shares reach a window size of 32 (the upper bound, Wg) 
and remain there. Other hosts have window sizes close to 
19. The average latency observed by the hosts remains close 
to £@, as shown in Figure 10(b). The throughput observed 
by hosts follows roughly the same pattern as window sizes, 
but is not always proportional because of array scheduling 
and block placement issues. We saw similar adaptation in 
window sizes and latency when we repeated this experiment 
using 2 = 30 ms (plots omitted due to space constraints). 


5.3.4 Capacity Changes 


Storage capacity can change dramatically due to workload 
changes or array accesses by uncontrolled hosts external 
to PARDA. We have already demonstrated in Section 5.3.2 
that our approach is able to absorb any spare capacity that 
becomes available. To test the ability of the control algo- 
rithm to handle decreases in capacity, we conducted an ex- 
periment starting with the first five hosts from the previous 
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Figure 10: Non-Uniform Workloads. PARDA control with # = 40 ms. Six hosts run mixed workloads, with B values 2:1:2:1:2:1. 
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Figure 11: Capacity Fluctuation. Uncontrolled external host added at t = 230 s. PARDA-controlled hosts converge to new window sizes. 


experiment. At time t = 230 s, we introduce a sixth host that 
is not under PARDA control. This uncontrolled host runs a 
Windows Server 2003 VM issuing 16 KB random reads to 
a 16 GB virtual disk located on the same LUN as the others. 

With & = 30 ms and a share ratio of 2:2: 1:1: 1 for 
the PARDA-managed hosts, Figure 11 plots the usual met- 
rics over time. At f = 230 s, the uncontrolled external host 
starts, thereby reducing available capacity for the five con- 
trolled hosts. The results indicate that as capacity changes, 
the hosts under control adjust their window sizes in propor- 
tion to their shares, and observe latencies close to 2%. 


5.4 End-to-End Control 


We now present an end-to-end test where multiple VMs run 
a mix of realistic workloads with different shares. We use 
Filebench [20], a well-known IO modeling tool, to gener- 
ate an OLTP workload similar to TPC-C. We employ four 
VMs running Filebench, and two generating 16 KB random 
reads. A pair of Filebench VMs are placed on each of two 
hosts, whereas the micro-benchmark VMs occupy one host 
each. This is exactly the same experiment discussed in Sec- 
tion 2; data for the uncontrolled baseline case is presented 
in Table 1. Recall that without PARDA, hosts | and 2 obtain 
similar throughput even though the overall sum of their VM 
shares is different. Table 6 provides setup details and reports 
data using PARDA control. Results for the OLTP VMs are 
presented as Filebench operations per second (Ops/s). 
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Host} VM Type S15 52 Bh VM1 VM2 Th 
1 2xOLTP | 20, 10 6 1266 Ops/s | 591 Ops/s | 1857 
2 | 2xOLTP | 10, 10 4 681 Ops/s | 673 Ops/s | 1316 
3 1x Micro 20 4 740 IOPS n/a 740 
4 | 1xMicro 10 2 400 IOPS n/a 400 























Table 6: PARDA end-to-end control for Filebench OLTP and 
micro-benchmark VMs issuing 16 KB random reads. Configured 
shares (s;), host weights (8,), Ops/s for Filebench VMs and IOPS 
(Ty, for hosts) are respected across hosts. 4 = 25 ms, Wmax = 64. 


We run PARDA (¥ = 25 ms) with host weights (;,) set 
according to shares of their VMs (8, = 6:4:4: 2 for hosts 
1 to 4). The maximum window size Wyqx is 64 for all hosts. 
The OLTP VMs on host | receive 1266 and 591 Ops/s, 
matching their 2: | share ratio. Similarly, OLTP VMs on 
host 2 obtain 681 and 673 Ops/s, close to their 1 : 1 share 
ratio. Note that the overall Ops/s for hosts 1 and 2 have a 
3:2 ratio, which is not possible in an uncontrolled scenario. 
Figure 12 plots the window size, latency and throughput ob- 
served by the hosts. We note two key properties: (1) win- 
dow sizes are in proportion to the overall B values and (2) 
each VM receives throughput in proportion to its shares. 
This shows that PARDA provides the strong property of 
enforcing VM shares, independent of their placement on 
hosts. The local SFQ scheduler divides host-level capacity 
across VMs in a fair manner, and together with PARDA, is 
able to provide effective end-to-end isolation among VMs. 
We also modified one VM workload during the experiment 
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Figure 12: PARDA End-to-End Control. VM IOPS are proportional to shares. Host window sizes are proportional to overall B values. 
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Figure 13: Handling Bursts. One OLTP workload on host | stops 
at t = 140 s and restarts at t = 310s. The B of host 1 is adjusted 
and window sizes are recomputed using the new B value. 


to test our burst-handling mechanism, which we discuss in 
the next section. 


5.5 Handling Bursts 


Earlier we showed that PARDA maintains high utilization 
of the array even when some hosts idle, by allowing other 
hosts to increase their window sizes. However, if one or 
more VMs become idle, the overall B of the host must be 
adjusted, so that backlogged VMs on the same host don’t 
obtain an unfair share of the current capacity. Our imple- 
mentation employs the technique described in Section 3.4. 

We experimented with dynamically idling one of the 
OLTP VM workloads running on host | from the previous 
experiment presented in Figure 12. The VM workload is 
stopped at t = 140 s and resumed at ¢ = 310s. Figure 13 
shows that the B value for host 1 adapts quickly to the 
change in the VM workload. Figure 12(a) shows that the 
window size begins to decrease according to the modified 
lower value of B = 4 starting from ft = 140s. By t = 300 s, 
window sizes have converged to a | : 2 ratio, in line with ag- 
gregate host shares. As the OLTP workload becomes active 
again, the dynamic increase in the B of host | causes its win- 
dow size to grow. This demonstrates that PARDA ensures 
fairness even in the presence of non-backlogged workloads, 
a highly-desirable property for shared storage access. 
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Uncontrolled PARDA 
HostVM Type} OPM (Avg Lat) Ty, Ly, Bh OPM |Avg Lat 


1 | SQL1 | 8799 | 213 | 615, 20.4 1 6952 | 273 
2 | SQL2 | 8484 | 221 | 588, 20.5 | 4 12356 | 151 
































Table 7: Two SQL Server VMs with 1: 4 share ratio, run- 
ning with and without PARDA. Host weights (8;,) and OPM (or- 
ders/min), IOPS (7), for hosts) and latencies (Avg Lat for database 
operations, Ly for hosts, in ms). Y = 15 ms, Wngx = 32. 


5.6 Enterprise Workloads 


To test PARDA with more realistic enterprise workloads, 
we experimented with two Windows Server 2003 VMs, 
each running a Microsoft SQL Server 2005 Enterprise Edi- 
tion database. Each VM is configured with 4 virtual CPUs, 
6.4 GB of RAM, a 10 GB system disk, a 250 GB database 
disk, and a 50 GB log disk. The database virtual disks are 
hosted on an 800 GB RAID-0 LUN with 6 disks; log vir- 
tual disks are placed on a 100 GB RAID-0 LUN with 10 
disks. We used the Dell DVD store (DS2) database test 
suite, which implements a complete online e-commerce ap- 
plication, to stress the SQL databases [7]. We configured a 
15 ms latency threshold, and ran one VM per host, assign- 
ing shares in a | : 4 ratio. 

Table 7 reports the parameters and the overall applica- 
tion performance for the two SQL Server VMs. Without 
PARDA, both VMs have similar performance in terms of 
both orders per minute (OPM) and average latency. When 
running with PARDA, the VM with higher shares obtains 
roughly twice the OPM throughput and half the average la- 
tency. The ratio isn’t 1 : 4 because the workloads are not 
always backlogged, and the VM with higher shares can’t 
keep its window completely full. 

Figure 14 plots the window size, latency and through- 
put observed by the hosts. As the overall latency decreases, 
PARDA is able to assign high window sizes to both hosts. 
When latency increases, the window sizes converge to be 
approximately proportional to the B values. Figure 15 
shows the B values for the hosts while the workload is run- 
ning, and highlights the fact that the SQL Server VM on 
host 2 cannot always maintain enough pending IOs to fill 
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Figure 14: Enterprise Workload. Host window sizes and IOPS for SQL Server VMs are proportional to their overall B values whenever 
the array resources are contended. Between t = 300 s and t = 380 s, hosts get larger window sizes since the array is not contended. 
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its window. This causes the other VM on host | to pick up 
the slack and benefit from increased IO throughput. 


6 Related Work 


The research literature contains a large body of work re- 
lated to providing quality of service in both networks and 
storage systems, stretching over several decades. Numerous 
algorithms for network QoS have been proposed, including 
many variants of fair queuing [2, 8, 10]. However, these ap- 
proaches are suitable only in centralized settings where a 
single controller manages all requests for resources. Stoica 
proposed QoS mechanisms based on a stateless core [23], 
where only edge routers need to maintain per-flow state, but 
some minimal support is still required from core routers. 


In the absence of such mechanisms, TCP has been serv- 
ing us quite well for both flow control and congestion avoid- 
ance. Commonly-deployed TCP variants use per-flow in- 
formation such as estimated round trip time and packet loss 
at each host to adapt per-flow window sizes to network con- 
ditions. Other proposed variants [9] require support from 
routers to provide congestion signals, inhibiting adoption. 


FAST-TCP [15] provides a purely latency-based ap- 
proach to improving TCP’s throughput in high bandwidth- 
delay product networks. In this paper we adapt some of 
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the techniques used by TCP and its variants to perform flow 
control in distributed storage systems. In so doing, we have 
addressed some of the challenges that make it non-trivial to 
employ TCP-like solutions for managing storage IO. 


Many storage QoS schemes have also been proposed to 
provide differentiated service to workloads accessing a sin- 
gle disk or storage array [4, 13, 14, 16, 25, 30]. Unfortu- 
nately, these techniques are centralized, and generally re- 
quire full control over all IO. Proportionate bandwidth allo- 
cation algorithms have also been developed for distributed 
storage systems [12,26]. However, these mechanisms were 
designed for brick-based storage, and require each storage 
device to run an instance of the scheduling algorithm. 


Deployments of virtualized systems typically have no 
control over storage array firmware, and don’t use a central 
IO proxy. Most commercial storage arrays offer only lim- 
ited, proprietary quality-of-service controls, and are treated 
as black boxes by the virtualization layer. Triage [18] is 
one control-theoretic approach that has been proposed for 
managing such systems. Triage periodically observes the 
utilization of the system and throttles hosts using band- 
width caps to achieve a specified share of available capacity. 
This technique may underutilize array resources, and relies 
on a central controller to gather statistics, compute an on- 
line system model, and re-assign bandwidth caps to hosts. 
Host-level changes must be communicated to the controller 
to handle bursty workloads. In contrast, PARDA only re- 
quires very light-weight aggregation and per-host measure- 
ment and control to provide fairness with high utilization. 


Friendly VMs [31] propose cooperative fair sharing 
of CPU and memory in virtualized systems leveraging 
feedback-control models. Without relying on a centralized 
controller, each “friendly” VM adapts its own resource con- 
sumption based on congestion signals, such as the relative 
progress of its virtual time compared to elapsed real time, 
using TCP-like AIMD adaptation. PARDA applies similar 
ideas to distributed storage resource management. 
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7 Conclusions 


In this paper, we studied the problem of providing 
coarse-grained fairness to multiple hosts sharing a single 
storage system in a distributed manner. We propose a novel 
software system, PARDA, which uses average latency as an 
indicator for array overload and adjusts per-host issue queue 
lengths in a decentralized manner using flow control. 

Our evaluation of PARDA in a hypervisor shows that it is 
able to provide fair access to the array queue, control over- 
all latency close to a threshold parameter and provide high 
throughput in most cases. Moreover, combined with a local 
scheduler, PARDA is able to provide end-to-end prioritiza- 
tion of VM IOs, even in presence of variable workloads. 

As future work, we are trying to integrate soft limits and 
reservations to provide a complete IO management frame- 
work. We would also like to investigate applications of 
PARDA to other non-storage systems where resource man- 
agement must be implemented in a distributed fashion. 
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Abstract 


We develop a holistic framework for adaptively schedul- 
ing asynchronous requests in distributed file systems. 
The system is holistic in that it manages all resources, 
including network bandwidth, server I/O, server CPU, 
and client and server memory utilization. It acceler- 
ates, defers, or cancels asynchronous requests in order 
to improve application-perceived performance directly. 
We employ congestion pricing via online auctions to co- 
ordinate the use of system resources by the file system 
clients so that they can detect shortages and adapt their 
resource usage. We implement our modifications in the 
Congestion-Aware Network File System (CA-NFS), an 
extension to the ubiquitous network file system (NFS). 
Our experimental result shows that CA-NFS results in 
a 20% improvement in execution times when compared 
with NFS for a variety of workloads. 


1 Introduction 


Distributed file system clients consume server and net- 
work resources without consideration for how their op- 
erations interfere with their future requests and other 
clients. Each client request incurs a cost to the sys- 
tem, expressed in increased load to one or more of its 
resources. As more capacity, more workload, or more 
users are added congestion rises, and all client operations 
share the cost in delayed execution. However, clients re- 
main oblivious to the congestion level of the system re- 
sources. 

When the system is under congestion, network file 
servers try to maximize throughput across clients, as- 
suming that their benefit increases with the flow rate. 
This practice does not correspond well with application- 
perceived performance because it fails to distinguish 
the urgency and relative priority of file system opera- 
tions across the client population. From the server’s 
perspective, all client operations at any given time are 
equally important. This is a fallacy. File system opera- 
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tions come at different priorities implicitly. While some 
need to be performed on demand, many can be deferred. 
Synchronous client operations (metadata, reads) bene- 
fit more from timely execution than asynchronous op- 
erations (most writes, read-aheads), because the former 
block the calling application until completion. Also, cer- 
tain asynchronous operations are more urgent than oth- 
ers depending on the client’s state. For example, when a 
client’s memory consumption is high, all of its write op- 
erations become synchronous, leading to a degradation 
in system performance. 

In this paper, we develop a performance management 
framework for distributed file systems that dynamically 
assesses system load, manages system resources, and 
schedules asynchronous client operations. When the sys- 
tem resources approach critical capacity, we apply prior- 
ity scheduling, preferring blocking to non-blocking re- 
quests, and priority inheritance, e.g. performing writes 
that block reads at high priority, so that non-time-critical 
(asynchronous) I/O traffic does not interfere with on- 
demand (synchronous) requests. On the other hand, if 
the system load is low, we perform asynchronous opera- 
tions more aggressively in order to avoid the possibility 
of performing the same operations at a later time, when 
the server resources will be congested. 

The framework is based on a holistic congestion pric- 
ing mechanism that incorporates all critical resources 
among all clients and servers, from client caches to 
server disk subsystems. Holistic goes beyond end-to- 
end in that it balances resource usage across multiple 
clients and servers. (End-to-end also connotes network 
endpoints and holistic management goes from client ap- 
plications to server disk systems.) The holistic approach 
allows the system to address different bottlenecks in dif- 
ferent configurations and respond to changing resource 
limitations over time. 

Servers encode their resource constraints by increas- 
ing or decreasing the price of asynchronous reads and 
writes in the system in order to “push back” at clients. 
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As the server prices increase, the clients that are not re- 
source constrained will defer asynchronous operations 
for a later time and, thus, reduce their presented load. 
This helps to avoid congestion in the network and server 
I/O system caused by non-critical operations. 

The underlying pricing algorithm, based on resource 
utilization, provides a 1og-k competitive solution to re- 
source pricing when compared with an offline algorithm 
that “knows” all future requests. In contrast to heuristic 
methods for moving thresholds, this approach is system 
and workload independent. 

We evaluate our proposed changes in CA-NFS 
(Congestion-Aware Network File System), an extension 
of the NFS protocol, implemented as modifications to the 
Linux NFS client, server, and memory manager. Experi- 
mental results show that CA-NFS outperforms NFS and 
improves application-perceived performance by more 
than 20% in a wide variety of workloads. 


2 System Operation 


In this section, we give the intuition behind schedul- 
ing asynchronous operations and the effect these have on 
system resource utilization. We then demonstrate how 
clients adapt their behavior using pricing and auctions. 


2.1 Asynchronous Writes 


The effectiveness of asynchronous write operations 
depends on the client’s current memory state. Writes 
are asynchronous only if there is available memory; a 
system that cannot allocate memory to a write, blocks 
that write until memory can be freed. This hampers per- 
formance severely because all subsequent writes become 
effectively synchronous. It also has an adverse effect on 
reads. All pending writes that must be written to storage 
interfere with concurrent reads, which results in queuing 
delays at the network and disk. 

CA-NFS changes the way that asynchronous writes 
are performed compared to regular NFS. NFS clients 
write data to the server’s memory immediately upon re- 
ceiving awrite () system call and also buffer the write 
data in local memory. The buffered pages are marked as 
dirty at both the client and the server. To harden these 
data to disk, the client sends a commit message to the 
server. The decision of when to commit the data to the 
server depends on several factors. Traditionally, systems 
used a periodic update policy in which individual dirty 
blocks are flushed when their age reaches a predefined 
limit [32]. Modern systems destage dirty pages when the 
number of dirty pages in memory exceeds a certain per- 
centage (flushing point), which is typically a small frac- 
tion of the available memory (e.g 10%). Then, a daemon 
wakes up and starts flushing dirty pages until an adequate 
number of pages have reached stable storage. 
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In contrast to regular NFS, CA-NFS clients adapt their 
asynchronous write behavior by either deferring or ac- 
celerating a write. CA-NFS clients accelerate writes by 
forcing the CA-NFS server to sync the data to stable stor- 
age so that the client does not need to buffer all of the cor- 
responding dirty pages. The idea behind write accelera- 
tion is that if the server resource utilization is low, there is 
no need to defer the commit to a later time. Also, clients 
may elect to accelerate writes in order to preserve their 
cache contents and maintain a high cache hit rate. Note 
that accelerating a write does not make the write opera- 
tion synchronous. Instead, it invokes the write-back dae- 
mon at the client immediately. 

Write acceleration possibly increases the server disk 
utilization and uses network bandwidth immediately. In 
write-behind systems, many writes are canceled before 
they reach the server [5, 34], e.g. writing the same file 
page repeatedly, or creating and deleting a temporary file. 
Thus, the load imposed to the server as a result of write 
acceleration could be avoided. However, write accel- 
eration has almost no negative effect on system perfor- 
mance, because CA-NFS accelerates writes only when 
the server load is low. 

Deferring a write avoids copying dirty data to server 
memory upon receiving a write request. Instead, clients 
keep data in local memory only, until the price of using 
the server resources is low. Clients price asynchronous 
writes based on their ability to cache writes, i.e. available 
memory. A client with scarce memory, because of write 
deferral, will increase its local price for writes so that its 
buffered pages will be transferred to the server as soon 
as possible. To make write deferral possible, we modify 
the operation of the write-back daemon on the clients by 
dynamically changing the flushing point value based on 
the pricing mechanism to dictate when the write-back of 
dirty pages should begin. 

Deferring a write consumes client memory with dirty 
pages, saves server memory, and delays the consump- 
tion of network bandwidth and server disk I/O. However, 
it faces the risk of imposing higher latency for subse- 
quent synchronous commit operations. This is because 
a file sync may require a network transfer of the dirty 
buffers from the client to server memory. Note that de- 
ferring a write does not guarantee that the server price for 
the same operation will be lower in the future. Instead, 
this policy gives priority to operations originating from 
resource-constrained clients. 

CA-NFS follows NFS’s close-to-open consistency 
model. Deferring or accelerating writes does not vio- 
late the consistency semantics of NFS, because CA-NFS 
does not change the semantics of the COMMIT opera- 
tion. Asynchronous write-back in NFS includes a dead- 
line that, when it elapses, escalates the operation to a syn- 
chronous write. CA-NFS does the same. 
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The server prices asynchronous writes based on its 
memory, disk and network utilization. If the server mem- 
ory contains blocks that are currently accessed by clients, 
setting high prices forces clients to defer writes in order 
to preserve cache contents and maintain a high cache hit 
rate. Also, if the disk or network resources are heavily 
utilized, CA-NFS defers writes until the load decreases, 
to avoid queuing delays because of pending writes that 
must be written to storage and interfere with concurrent, 
synchronous reads. If the system resources are under- 
utilized, the server encourages clients to flush their dirty 
data by decreasing its price. 


2.2 Asynchronous Reads 


CA-NFS attempts to optimize the scheduling of asyn- 
chronous reads (read-ahead). Servers set the price for 
read-aheads based on the disk and network utilization. 
If the server resources are heavily congested, CA-NFS 
servers are less willing to accept read-ahead operations. 

A client’s willingness to perform read-ahead depends 
on its available memory and the effectiveness of the oper- 
ation. If the server and network resources are congested 
so that the server’s read-ahead price is higher than their 
local price, clients perform read-ahead prudently in fa- 
vor of synchronous operations. Capping the number of 
read-ahead operations saves client memory, delays the 
consumption of network bandwidth, but often converts 
cache hits into synchronous reads because data were not 
preloaded into the cache. On the other hand, if the server 
price is low, clients perform read-ahead more aggres- 
sively. 


2.3. CA-NES in Practice 


Figure | shows the high-level operation of the system 
and how the pricing model make clients adapt their be- 
havior based on the state of the system. At this time, our 
treatment of pricing is qualitative. We describe the de- 
tails of constructing appropriate pricing models in Sec- 
tion 3.3. 

The server sets the price of different operations to 
manage its resources and network utilization in a coor- 
dinated fashion. In this example, the server’s memory is 
near occupancy and it is near its maximum rate of I/O per 
second (IOPS). Based on this, it sets the price of asyn- 
chronous writes to be relatively high, because they con- 
sume server memory and add IOPS to the system. 

CA-NFS allows the system to exchange memory con- 
sumption between the clients and the server. Clients 
adapt their prices based on their local state. Client #1 has 
available memory, so it stops writing dirty data. Client #2 
is nearing its memory bound and, if it runs out of mem- 
ory, applications will block awaiting the completion of 
asynchronous writes. Thus, even though the server price 
of asynchronous writes is high, this client is willing to 
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Figure 1: Overview of Congestion-Aware NFS. Clients and 
servers monitor their resource usage from which they derive 
prices for the different file system operations. (AW = asyn- 
chronous write, RA = read ahead, RA eff = read-ahead effec- 
tiveness.) 











pay in order to avoid exhausting its memory. When the 
server clears its memory, it will lower the price of asyn- 
chronous writes and Client #1 will commence writing 
again. Servers notify clients about their prices as part 
of the CA-NFS protocol. 

The criteria for whether to perform read-ahead pru- 
dently or aggressively are similar. Client #1 has lots of 
available memory, a read-dominated workload, and good 
read-ahead effectiveness, so that read-ahead turns most 
future synchronous reads into cache hits. Thus, it is will- 
ing to pay the server’s price and perform more aggressive 
read-ahead. Client #2 has a write-dominated workload, 
little memory, and a relatively ineffective cache. Thus, it 
halts read-ahead requests to conserve resources for other 
tasks. 


3 Pricing Mechanism 


In distributed file systems, resources are heteroge- 
neous and, therefore, no two of them are directly com- 
parable. One cannot balance CPU cycles against mem- 
ory utilization or vice versa. Nor does either resource 
convert naturally into network bandwidth. This makes 
the assessment of the load on a distributed system dif- 
ficult. Previous models [20, 38, 44] designed to manage 
load and avoid throughput crashes via adaptive schedul- 
ing focus on one resource only or rely on high-level ob- 
servations, such as request latency. The price unifica- 
tion model in CA-NFS provides several advantages: (a) 
it takes into account all system resources, (b) it unifies 
congestion across all devices in order to be comparable, 
and (c) it identifies bottlenecks across all clients and the 
server in a collective way. 

Underlying the entire system, we develop a unified al- 
gorithmic framework based on competitive analysis for 
the efficient scheduling of distributed file system opera- 
tions with respect to system resources. We rely on the 
algorithm of Awerbuch et al. [4] for bandwidth shar- 
ing in circuit-sharing networks with permanent connec- 
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tions that uses an online auction model to price conges- 
tion in a resource independent way. We adapt this theory 
to distributed file systems by considering the path of file 
system operations, from the client’s memory to server’s 
disk, as a short-lived circuit. 


CA-NFS uses a reverse auction model. In a reverse 
auction, the buyer advertises a need for a service and the 
sellers place bids, like a regular auction. However, the 
seller who places the lowest bid wins the auction. Ac- 
cordingly in CA-NFS, when the client is about to issue a 
request, it compares its local price with the server price. 
Depending on who offers the lower price the client ac- 
celerates, or defers the operation. 


We start by describing an auction for a single resource. 
We then build a pricing function for each resource and 
assemble these functions into a price for each NFS oper- 
ation. 


3.1 Algorithmic Foundation 


For each resource, we define a simple auction in an 
online setting in which the bids arrive sequentially and 
unpredictably. In a way, a bid represents the client’s will- 
ingness to pay for the use of the resource, i.e. the client’s 
local price. A bid will be accepted immediately if it is 
higher than the price of the resource at that time. 


Our goal is to find an online algorithm that is com- 
petitive to the optimal offline algorithm in any fu- 
ture request sequence. The performance degradation 
of an online algorithm (competitive ratio) is r = 
maz (Boptine /Bontine) in which Bopine is the benefit 
from the offline optimal algorithm and B,,,);,¢ the bene- 
fit from the online algorithm. Awerbuch et al. [4] estab- 
lish the lower bound at Q(log k) in which k is the ratio 
between the maximum and minimum benefit realized by 
the online algorithm over all inputs. The lower bound is 
achieved when reserving 1 / log k of the resource doubles 
the price. 


The worst case occurs when the offline algorithm sells 
the entire resource at the maximum bid P, which was re- 
jected by the online algorithm. For the online algorithm 
to reject this bid, it must have set the price greater than P, 
which means it has already sold 1/ log k of the resource 
for at least P/2. 


P 
online > 2logk 
Bofine — Bontine < P 

= r<1+2logk 


B and 


Increasing price exponentially with increased utilization 
leads to a competitive ratio logarithmic in k. 
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3.2 A Practical Pricing Function 


This model gives us an online strategy that is prov- 
ably competitive with the optimal offline algorithm in the 
maximum usage of each resource. It has a weak (log, not 
constant) competitive ratio, but even this weak ratio is 
unprecedented in the storage system’s literature. The on- 
line algorithm knows nothing about the future, assumes 
no correlation between past and future requests, and is 
only aware of the current system state. 

Based on the theoretical framework, we define the 
pricing function P; for an individual resource 7 in our 
framework as 


_p {kh =1 
P, (uj) = Pmav "te — 1} 


in which the utilization u; varies between 0 and 1 so that 
the price varies between 0 and Paz. 

The parameter & represents the performance degrada- 
tion experienced by the end user as the resource becomes 
congested. Thus, appropriate values of k should provide 
incremental feedback as the resource usage increases. 

The heterogeneous resources of distributed file sys- 
tems complicate parameter selection. Different resources 
become congested at different levels of utilization, which 
dictates that parameters need to be set individually. With 
very large k, the price function stays near zero until the 
utilization is almost 1. Then the price goes up very 
quickly. With very small k, the resource becomes ex- 
pensive at lower utilization, which throttles usage prior 
to congestion. The network exhibits few negative ef- 
fects from increased utilization until near its capacity 
and, thus, calls for a higher setting of k. Similarly, mem- 
ory works well until it’s nearly full at which point it expe- 
riences congestion in the form of fragmentation and syn- 
chronous stalls from out-of-memory conditions. Disks, 
on the other hand, require smaller values of k, because 
each additional I/O interferes with all subsequent (and 
some previous) I/Os, increasing the service time by in- 
creasing queue lengths and potentially moving the head 
out of position. 

CA-NFS users do not need to set the value of & explic- 
itly, as it is precomputed for most existing device types. 
The pricing mechanism is robust to small hardware varia- 
tions, e.g to different device brands. During various CA- 
NFS deployments, we experimented extensively with the 
value of k. (We do not present all these experiments as 
they are quite tedious.) 

We approximate the cumulative cost of all resources 
by the highest cost (most congested) resource. The high- 
est cost resource corresponds well with the system bot- 
tleneck. P,,qx is the same for all server resources and 
the exponential nature of the pricing functions ensures 
that resources under load become expensive quickly. 
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In order to avoid the effects of over-tuning and enforce 
stability, we set two additional constraints on the cost 
function. Clients assign an infinitesimally higher value 
to the maximum price for their resources (Prax +e) than 
do servers. This ensures that when both the client and the 
server are overloaded, the client sends the operations to 
the server. In practice, servers deal with overload more 
gracefully than do clients. Also, the client’s prices are 
always higher than a minimum price P,,;n, so that if nei- 
ther the client nor the server is congested, operations are 
performed at the server. 


3.3. Calculating Resource Utilization 


The theoretical model does not make any explicit as- 
sumptions about the type of resources managed. As a 
result, adding new resources to the system is straight- 
forward. We currently monitor the effective usage of five 
resources, each with its own intricacies: 


Server CPU: It is straightforward to establish the uti- 
lization of the CPU accurately at any given time through 
system monitoring. 


Client and Server Network: The utilization of net- 
works is also well defined. However, network bandwidth 
needs to be time-averaged to stabilize the auction. With- 
out averaging, networks fluctuate between utilization 0 
when idle and | when sending a message. The price 
would be similarly extreme and erratic. Thus, we moni- 
tor the average network bandwidth over a few hundreds 
of milliseconds. 


Server Disk: Measuring disk utilization is difficult be- 
cause of irregular response times. Although observed 
throughput seems a natural way to represent utilization, 
it is not practical because it depends heavily on the work- 
load. A sequential workload experiences higher through- 
put than a random set of requests. However, disk utiliza- 
tion may be higher in the latter case, because the disk 
spends head time seeking among the random requests. 

We measure disk utilization by sampling the length of 
the device’s dispatch queue at regular, small time inter- 
vals. The maximum disk utilization depends on the sys- 
tem configuration. We do not identify the locality among 
pending operations nor do we use device-specific infor- 
mation. Recently, Fahrad [36] and Zygaria [21] showed 
the effectiveness of measuring disk utilization by exam- 
ining the disk head time. We plan to evaluate this ap- 
proach in future work. 


Client and Server Memory: Pricing memory consump- 
tion is exceedingly difficult, because memory is a single 
resource used by many applications for many purposes, 
caching for reuse, dirty buffered pages, and read ahead. 
A cache must preserve a useful population of read-cache 
pages. Deferring writes in CA-NFS could reserve more 


memory pages to buffer writes, which may in turn re- 
duces cache hit rates, To avoid this, we identify the por- 
tion of RAM that is actively used to cache read data and 
the effectiveness of that cache. We then use pricing to 
preserve that portion of memory in order to maintain 
cache hit rates. The price of memory increases if the ex- 
isting set of pages yields a high cache hit rate or there are 
a large number of dirty pages that have triggered write- 
back. 

Previous research [6] allows us to effectively track the 
utility of read cache pages through the use of two ghost 
caches. We introduce a virtual resource to monitor by 
using the distribution of read requests among the ghost 
caches to calculate the projected cache hit rates, and 
thus, the effective memory utilization. A large fraction 
of read requests falling in these regions indicates that the 
client would benefit from more read caching, so defer- 
ring writes is not of particular benefit. 


Client Read-Ahead Effectiveness: We define a virtual 
resource that captures the expected efficiency of read- 
ahead [24,37]. We build our metric of read-ahead con- 
fidence on the adaptive read-ahead logic recently intro- 
duced in the Linux kernel [12]. We define confidence as 
the ratio of accesses to read-ahead pages divided by the 
total number of pages accessed for a specific file. For 
high values, the system performs read-ahead more ag- 
gressively. For low values, the kernel will be more reluc- 
tant to do the next read-ahead. 


3.4 CA-NFS Implementation 


We have implemented CA-NFS by modifying the ex- 
isting Linux NFS client and server in the 2.6.18 kernel. 
Specifically, we added support for the exchange of pric- 
ing information and we changed the NFS write operation 
to add support for acceleration and deferral. We have 
also made modifications to the Linux memory manager 
to support the classification of the memory accesses and 
the read-ahead heuristics. 

The CA-NFS server advertises cost information to 
clients, which implement the scheduling logic. We have 
overridden the FSSTAT protocol operation (NFSv3) to 
include pricing information about server resources. Nor- 
mally, FSSTAT retrieves volatile file system state infor- 
mation, such as the total size of the file system or the 
amount of free space. Upon a client’s FSSTAT request, 
the server encodes the prices of operations based on its 
monitored resource usage. In our implementation, the 
server computes the statistics of the resource utilization 
and updates its local cost information every one second. 
FSSTAT is a lightweight operation that adds practically 
no overhead to the system resources. Clients do not block 
waiting for the operation to complete. 

Clients send an FSSTAT request to the server every 
ten READ or WRITE requests or when the time interval 
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Figure 2: Time to copy a 4GB directory over NFS and CA-NFS 
and a breakdown of CA-NFS savings 


from the previous query is more than ten seconds. As 
part of CA-NFS extensions, we intend to have the server 
notify active clients via callbacks when its resource us- 
age increases sharply. 


4 Evaluation 


We run experiments on a cluster of twenty-four ma- 
chines running at 3.2GHz with 2GB of RAM each. One 
machine has 4GB of RAM and acts as the server. All 
nodes are connected via Gigabit Ethernet. To compare 
CA-NFS with NFSv3, we run a set of micro-benchmarks 
and application workloads based on the different profiles 
available in filebench [25], Sun’s filesystem benchmark, 
and [Ozone [18]. 


4.1 Microbenchmarks 


We start our analysis with a simple filebench experi- 
ment. A single thread of just one client copies a large 
directory of 4GB over CA-NFS and NFS. This workload 
creates a hierarchical directory tree, then measures the 
rate at which files can be copied from the source tree 
to the new tree. The sizes of the files in the directory 
vary from 1KB to 200MB. Even in such a simple config- 
uration, CA-NFS provides 15% improvement in perfor- 
mance, measured by completion time (Figure 2). 

Regular NFS clients fail to use their local memory to 
good effect even though it is not congested. NFS clients 
read data from the server and start buffering write pages 
until they reach the statically defined limit of dirty pages. 
Then, the flushing daemon forces the pages to be writ- 
ten to the server. This requires the server to harden data 
to disk. The resulting write traffic delays disk read re- 
quests. In contrast, CA-NFS clients determine that the 
server disk is heavily utilized through the exchange of 
pricing information. CA-NFS clients use a much larger 
portion of their RAM to buffer dirty pages, avoiding the 
large, asynchronous writes to the server that interfere 
with reads. The effects of read-ahead optimizations are 
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Figure 3: Application execution time and cache hit rates when 
accelerating writes 


less dramatic, but still important. Read-aheads are issued 
aggressively in the beginning, because there is free mem- 
ory space and they yield a high hit rate for this mostly 
sequential workload. As the server memory resources 
become more congested with dirty pages, client-initiated 
read-aheads are performed more prudently. Figure 2 also 
breaks out the portion of the improvement attributed to 
write (12%) and read-ahead optimizations (3%). 


4.1.1 Operation Scheduling 


Accelerating writes: The next experiment combines 
two [Ozone workloads to show how CA-NFS preserves 
cache hit rates by valuing client memory highly. We con- 
sider a client application that writes a 2GB file sequen- 
tially. On the same client, another application performs 
re-reads, i.e. reads that will be server cache hits if the 
system does not evict the pages. 

Figure 3 shows the execution times of the two appli- 
cations for NFS and CA-NFS. CA-NFS improves read 
performance by 21% when compared with NFS. The 
NFS client evicts memory pages used for read caching 
in order to buffer writes. This reduces the cache hit rate 
and application-perceived read performance as a conse- 
quence. NFS clients replace approximately 15% of the 
pages used for caching and realize a cache hit rate of only 
70%. In contrast, CA-NFS accelerates writes by flush- 
ing them immediately, anticipating the importance of the 
cache contents. CA-NFS clients maintain a cache hit rate 
of 90%. The client prefers to accelerate all asynchronous 
writes, because its read cache is producing a high hit rate, 
thus its price for asynchronous writes is high. The server 
price for asynchronous writes is low, because none of its 
resources is congested. 


Deferring writes: We now demonstrate how CA-NFS 
uses write-buffering at the client to avoid I/O interference 
at the server. One client issues random reads that are ser- 
viced by the server’s disk. Another client writes a 1GB 
file to the server. The NFS client sends the write requests 
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Figure 4: Execution time for two clients reading and writing to 
a file when deferring writes 


to the server, which flushes them to stable storage. These 
disk writes increase the service time of disk reads, be- 
cause they interfere at the disk with read requests coming 
from the first client. Through pricing, CA-NFS identifies 
congestion at the disk, which causes the writing client 
to buffer dirty data and reduces the amount of write data 
delivered to the server. Figure 4 shows that CA-NFS im- 
proves read performance by 18%. In this case, write per- 
formance is also improved by 6%. 


4.1.2 On the Pricing Metric 


We characterize how the pricing function captures sys- 
tem dynamics by comparing resource utilization and re- 
source price side-by-side. We show that pricing reflects 
congestion on heterogeneous resources, i.e. on networks, 
for memory, and on the disk. Prices create a single view 
of system load in a resource independent manner. 

For the network resource, we run a network inten- 
sive workload with four clients reading a 1GB file from 
server’s memory at a rate of 50MB/sec each over a GbE 
network. Three clients suffice to saturate the network 
bandwidth. We start each client at 10 second time in- 
tervals in order to provide incremental load to the sys- 
tem. Figure 5(a) plots the server-perceived throughput 
and the average throughput at the clients. Figure 5(b) 
shows the server’s system price, governed by the net- 
work, at the same time scale. As the system load in- 
creases, each client gets a smaller share of the bandwidth 
and average client throughput drops. Over time, the net- 
work price increases to near it maximum value (1.0, the 
value of Py,ax in all experiments). The increase in price 
causes the clients to back off, preventing overload, and 
the server throughput remains stable under heavy load. 

We run a similar experiment for memory-bound work- 
loads and memory price. A client issues reads by increas- 
ing the number of requests for already accessed data. By 
recycling the client’s memory, we force all re-read re- 
quests to be serviced out of the server cache only. From 
the server’s perspective, as the hit ratio increases cache 
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Figure 6: CA-NFS throughput sensitivity to the correct param- 
eterization of k 


data become more important, so it increases the price for 
memory operations (Figures 5(c) and 5(d)). 

For disk-bound workloads, we run a client process 
that issues random read requests over increasingly larger 
spans of the disk. As we increase the span, client 
throughput drops from increased disk head utilization 
that leads to more requests in the disk dispatch queue 
(Figure 5(e)). In response to the increase in the number 
of pending requests, the system increases the price for 
the disk resource (Figure 5(f)). 

In the next experiment, we show how the selection of 
parameter k affects system performance. We run a read- 
write [Ozone workload on two clients accessing a 2GB 
file on the server. Through measurements, we have estab- 
lished a value of k for each device type, which yields the 
best throughput for this experiment. We alter the value 
of & for the client and server memory, the network and 
the disk resources, and we examine the drop in system 
throughput. 

Figure 6 shows that small perturbations of k do not af- 
fect CA-NFS performance. However, if the value of k 
differs significantly from its optimal (as calculated) set- 
ting, performance degradation is notable. Low values of 
k; lead to underutilization of the system resources, while 
high values make the system less adaptive, as prices in- 
crease very rapidly. Figure 6 also shows that the disk 
and the network are more sensitive to correct parame- 
terization. This is because, these resources exhibit very 
high (disk) or very low interference (network) between 
past and future requests. As already mentioned, CA-NFS 
users do not have to set the value of k explicitly. In the 
next set of experiments, we show that the CA-NFS pa- 
rameter selection is robust to different workload types. 


4.2 Application Benchmarks 

Microbenchmark experiments demonstrate the opera- 
tion of CA-NFS by isolating the benefits of individual 
optimizations. To better understand how CA-NFS effects 
applications, we turn our attention to macrobenchmarks. 
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Figure 5: Examining the pricing mechanism for three different resources (network (a,b), memory (c,d), and disk (e,f)) 
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Figure 7: Average number of ops/sec per client for the file- 
server benchmark 


First, we evaluate CA-NFS by running the 
fileserver synthetic workload provided by 
filebench, on eight clients. This workload is mod- 
eled after SPECsfs [39], an industry standard test suite 
that is based on data collected by SFS committee mem- 
bers from thousands of real NFS servers operating at 
customer sites. The test performs a sequence of creates, 
deletes, appends, reads, writes and attribute operations 
on the file system. We randomly set the number of user 
threads, the number of files written and the average 
file size to numbers between 100-200, 1000-5000 and 
100-5120KB respectively. This workload contains a 
large number of asynchronous operations. 

Figure 7 shows that CA-NFS outperforms NFS by 
more than 10% in the single client setup and by more 
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Figure 8: CDF of the time that the system schedules write- 
backs for NFS and CA-NFS 


than 20% in the eight-client setup. Figure 8 shows the cu- 
mulative distribution function of the time that elapses be- 
tween a write operation submitted by the application and 
the relevant pages marked for commit by the file system. 
CA-NFS schedules asynchronous write operations very 
differently from NFS. NFS clients are forced to commit 
many pages almost immediately as they become dirty, in 
order to prevent the system from running out of mem- 
ory to buffer dirty pages. No page stays dirty for more 
than 12 seconds after the write is issued. CA-NFS sched- 
ules the write-back operations more evenly across the 30- 
second time frame that defines the reliability window for 
asynchronous writes in most current operating systems. 
As a result, traffic in CA-NFS is less bursty, a significant 
factor that improves performance. 
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Figure 9: Aggregate client throughput for the oltp benchmark a 
function of the number of clients 


























NFS | CA-NFS 

ops/sec 
2443 2503 

ms/op 
open file | 34.3 33.7 
read file 5.3 5.1 
close file 4.0 4.2 
append log 7A hed, 





Table 1: NFS and CA-NFS under the webserver workload 


The next benchmark examines CA-NFS characteris- 
tics under an OLTP workload that performs transactions 
into a filesystem using an I/O model from Oracle 9i. 
This workload tests for the performance of small random 
reads and writes in conjunction with moderate (128KB) 
synchronous writes. Operations represent read and write 
OLTP transactions and writes to the log file respectively. 
On each client, we launch 200 reader processes, 10 pro- 
cesses for asynchronous writing, and a log writer. We 
run the experiments four times, modifying the number of 
active clients. 

For the oltp workload, CA-NFS is more scalable 
than NFS. Although this workload exhibits some cache 
locality on the server, the main bottleneck in this exper- 
iment is the server’s disk, which is overwhelmed by the 
number of incoming requests. Figure 9 plots the aggre- 
gate client throughput for different client populations. 
For a small number of clients (one to four), CA-NFS 
provides a rather small performance advantage. As the 
number of clients increases, the relative throughput of 
CA-NFS increases when compared with NFS. In the case 
of NFS, the aggregate throughput for the 23-client setup 
is less than in the 12-client setup. This is because the 
number of incoming requests overwhelms the server re- 
sulting in a throughput crash. 

In our last experiment, we examine the performance 
of CA-NFS under a workload that contains mostly syn- 
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Figure 10: Number of asynchronous client writes blocked 


chronous read operations. The file server exports its 
filesystem to a number of web servers. The webserver 
workload from the filebench suite consists of a mix of 
open/read/close of multiple files in a directory tree, plus 
a file append (to simulate the web log) in which 16KB is 
appended to the weblog for every 10 reads. 

CA-NFS performs slightly better (about 3%) than NFS 
thanks to the read-ahead optimizations. Write optimiza- 
tions are not a factor in this benchmark, because of the 
small number of write operations. Table 4.2 shows that 
for all operations the latency for both NFS and CA-NFS 
is almost identical. 

Macrobenchmark experiments show that CA-NFS sig- 
nificantly outperforms NFS under workloads with a sig- 
nificant number of asynchronous operations, such as 
the fileserver benchmark. For workloads that are 
read-dominated (webserver) or contain small, asyn- 
chronous requests (oltp), CA-NFS performs compa- 
rably to NFS, showing that our modifications are light- 
weight. 


4.3 High-Speed Hazards 


To further evaluate our framework, we perform mea- 
surements over a 10-Gbps Infiniband network. As op- 
posed to the previous set of experiments, in this setup, the 
network bandwidth outstrips disk transfer rates. We con- 
sider the two clients writing file data sequentially to the 
server for fifteen seconds over NFS and CA-NFS. During 
the write burst, both clients write data at the maximum 
rate, close to 200MB/sec. 

This experiment shows that running out of mem- 
ory turns asynchronous file system operations into syn- 
chronous that block all progress (Figure 10). Regular 
NFS experiences synchronous waits for asynchronous 
writes starting at 2 seconds. When the number of dirty 
pages on the NFS clients reaches the flushing point, 
clients start writing data, which overwhelms the disk sys- 
tem and memory available to buffer writes at the server 
fills. All subsequent writes block awaiting completion. 
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CA-NFS detects congestion on the server memory and 
I/O system through pricing and buffers writes in local 
memory. This makes the effective write-buffering space 
8GB, 2GB on each client and 4GB on the server, rather 
than the 4GB of server memory that NFS uses. CA-NFS 
does not experience synchronous waits until 8 seconds 
and blocks fewer writes overall. This results in higher 
overall throughput as well. CA-NFS writes 4.8GB worth 
of data whereas regular NFS writes only 3.9GB, an im- 
provement of 23%. 


This scenario shows how the emergence of high-speed 
networking makes holistic storage management critical. 
For storage systems, we are on the verge of a new era. In- 
finiband and 10Gbps Ethernet deliver data at such rapid 
rates that storage systems that receive and process these 
data cannot keep up. This gap between network band- 
width and disk throughput creates a memory crisis for 
storage servers. Many clients writing data in parallel will 
create a data stream that a server cannot transfer to disk. 
Flow control in the transport protocol will be irrelevant, 
because the system is not network bound. The server 
buffers data pending I/O completion and the buffered 
data accumulate until memory is full. This results in a 
cascading throughput crash over the entire system [11]. 


5 Future Directions 


Although our focus is on the scheduling of asyn- 
chronous operations, pricing synchronous operations 
wisely can enable the system to manage nonstandard I/O 
processes. Distributed file systems often have lower- 
priority I/O tasks, such as data mining, indexing, backup, 
etc. Capping the willingness to pay for synchronous op- 
erations causes these low-priority tasks to halt automati- 
cally when resources become congested. Clients can also 
encode application priorities and differentiate between 
critical and noncritical tasks by charging different pro- 
cesses different prices. 


The proposed framework does not address the issue 
of fairness over time. Operation costs are proportional 
to the current state of the system but independent of the 
client that put the system into the state. For example, one 
client could fill the server cache with dirty data, pushing 
up prices for all others. 


Finally, more complex resource management goals 
can be realized by adding constraints to the auction 
model. For example, resource reservations can be ac- 
complished by differentially pricing the same resource 
among clients. The goal is to insulate one client from 
the consumption by non-reserving clients. To do so, 
we need to limit the spending of non-reserving clients 
and increase resource prices prior to exhaustion, creating 
an artificial shortage. Also, proportional sharing arises 
when clients are given salaries, i.e. a rate of consump- 
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tion or fixed amount of spending over some time interval. 
This concept extends the ideas of flow control beyond 
networks to cover all resources in the system. Pricing 
certain resources and making all other resources free, al- 
lows sharing to be targeted to specific resources only. 


6 Related Work 


Economic Models: Using economic models for re- 
source management is not a novel approach [10]. 
Auction-based systems have been applied in a broad 
range of distributed systems including clusters [9], com- 
putational grids [26], parallel computers [40], and Inter- 
net computing systems [27]. These systems are intended 
for coarse-grained resource allocations. 


Network Flow Control: Flow control schemes offer to 
each client a proportional share of the network and, thus, 
guarantee to a large extent fairness [31]. Many differ- 
ent approaches exist in the literature, including TCP- 
like window based protocols [14, 19], feedback schemes 
[13], and optimization based methods [15]. 

The congestion pricing techniques upon which we 
build have been used by Amir et al. [2] to manage a 
single network resource. Kelly [23] was the first to de- 
scribe pricing for flow and congestion control. However, 
our approach and Amir’s are algorithmic, whereas Kelly 
relies on economic theory. 


Memory Management: Li et al [28] acknowledge the 
asynchronous nature of writes and their dependence on 
the client’s state. They propose a scheme where the stor- 
age clients inform the storage servers about the types of 
writes that they perform by passing write hints. These 
write hints can then be used by the server to manage the 
second-tier cache. 

Carson and Setia [7] showed that for many workloads, 
periodic updates from a write-back cache perform worse 
than write-through caching. They suggest two alternate 
disciplines: (1) giving reads non-preemptive priority and 
(2) interval periodic writes in which each write gets its 
own fixed period in the cache. Mogul [32] implements 
an approximate interval periodic write-back policy that 
staggers writes in time using a small (one second) timer. 
Golding et al [16] delay write-back until the system 
reaches an idle period. This reduces the delays seen by 
reads by postponing competing writes until idle periods, 
possibly with the help of nonvolatile memory, in order to 
ensure consistency. 

Storage controllers with nonvolatile memory employ 
adaptive destaging policies that vary the rate of writing 
[1,43] or the destage threshold [33, 43], based on mem- 
ory occupancy and filling and draining rates. In these 
systems, cached writes are persistent, so they want to de- 
lay destaging data as long as possible. 
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Patterson et al [35] in TIP made cache residency and 
prefetching decisions over the network following a cost 
benefit analysis. Their work was based on models that 
value memory pages for different type of data, such as 
prefetched, buffered, or cached. Nelson et al [34] in 
Sprite mentioned a weight used to trade off how to parti- 
tion memory between pages for the file cache and for vir- 
tual memory. Nelson’s principle was not applied to a dis- 
tributed context. These approaches use heuristic methods 
and do not look at the relative load across all clients. 


Storage Scheduling and QoS: Storage quality of ser- 
vice (QoS) attempts to optimize the system resources in- 
dividually [17,29] or conjunctively [22]. Fairness in the 
QoS context is generalized to incorporate weights used 
to introduce deliberate bias, depending for example on 
different service-level agreements (SLAs) [45]. 

In general, quality of service (QoS) approaches are not 
well-suited for multi-resource optimization. CA-NFS 
complements QoS methods [8, 22, 30,42] that employ 
I/O throttling in order to limit resource congestion and 
avoid throughput crashes. We do not offer the perfor- 
mance guarantees to applications on which one might 
build SLAs [29]. Instead, we follow a best-effort ap- 
proach to improve application-perceived performance by 
minimizing latency and maximizing throughput for syn- 
chronous file system operations. 


Provisioning: Provisioning systems use a single metric, 
utility or cost in dollars, to unify heterogeneous resources 
when deciding the initial configuration of a system under 
a fixed utility budget. Recently, Strunk et al. [41] provide 
a framework for provisioning based on detailed system 
models and genetic algorithms to explore the configura- 
tion space. This extends the previous work on provision- 
ing of Andersonet al. [3]. 

While the unification of resources using utility is 
superficially similar to pricing, provisioning solves a 
very different problem. Provisioning determines how to 
achieve the best availability, throughput, or IOPS under 
a fixed budget as a static offline configuration problem. 
CA-NFS examines dynamic pricing of operations under 
changing workloads in static configurations. 


7 Conclusions 


We have shown the importance of using holistic per- 
formance management for the adaptive scheduling of 
lower-priority distributed file system requests based on 
system congestion in order to reduce their interference 
with foreground, synchronous requests. We also show 
the virtue of adaptation based on application-perceived 
performance, rather than server-centric metrics. 

CA-NFS introduces a new dimension in resource man- 
agement by implicitly managing and coordinating the us- 


age of the file system resources among all clients. It 
unifies fairness and priorities in a single framework that 
assures that realizing optimization goals will benefit file 
system users, not the file system servers. 
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Abstract 


We present sparse indexing, a technique that uses sam- 
pling and exploits the inherent locality within backup 
streams to solve for large-scale backup (e.g., hundreds of 
terabytes) the chunk-lookup disk bottleneck problem that 
inline, chunk-based deduplication schemes face. The 
problem is that these schemes traditionally require a full 
chunk index, which indexes every chunk, in order to de- 
termine which chunks have already been stored; unfortu- 
nately, at scale it is impractical to keep such an index in 
RAM and a disk-based index with one seek per incoming 
chunk is far too slow. 

We perform stream deduplication by breaking up an 
incoming stream into relatively large segments and dedu- 
plicating each segment against only a few of the most 
similar previous segments. To identify similar segments, 
we use sampling and a sparse index. We choose a small 
portion of the chunks in the stream as samples; our sparse 
index maps these samples to the existing segments in 
which they occur. Thus, we avoid the need for a full 
chunk index. Since only the sampled chunks’ hashes are 
kept in RAM and the sampling rate is low, we dramat- 
ically reduce the RAM to disk ratio for effective dedu- 
plication. At the same time, only a few seeks are re- 
quired per segment so the chunk-lookup disk bottleneck 
is avoided. Sparse indexing has recently been incorpo- 
rated into number of Hewlett-Packard backup products. 


1 Introduction 


Traditionally, magnetic tape has been used for data back 
up. With the explosion in disk capacity, it is now af- 
fordable to use disk for data backup. Disk, unlike tape, 
is random access and can significantly speed up backup 
and restore operations. Accordingly, disk-to-disk backup 
(D2D) has become the preferred backup option for orga- 
nizations [3]. 

Deduplication can increase the effective capability of 


a D2D device by one or two orders of magnitude [4]. 
Deduplication can accomplish this because backup sets 
have massive redundancy due to the facts that a large pro- 
portion of data does not change between backup sessions 
and that files are often shared between machines. Dedu- 
plication, which is practical only with random-access de- 
vices, removes this redundancy by storing duplicate data 
only once and has become an essential feature of disk- 
to-disk backup solutions. 

We believe chunk-based deduplication is the dedu- 
plication method best suited to D2D: it deduplicates 
data both across backups and within backups and does 
not require any knowledge of the backup data format. 
With this method, data to be deduplicated is broken 
into variable-length chunks using content-based chunk 
boundaries [20], and incoming chunks are compared 
with the chunks in the store by hash comparison; only 
chunks that are not already there are stored. We are inter- 
ested in inline deduplication, where data is deduplicated 
as it arrives rather than later in batch mode, because of 
its capacity, bandwidth, and simplicity advantages (see 
Section 2.2). 

Unfortunately, inline, chunk-based deduplication 
when used at large scale faces what is known as the 
chunk-lookup disk bottleneck problem: Traditionally, this 
method requires a full chunk index, which maps each 
chunk’s hash to where that chunk is stored on disk, in or- 
der to determine which chunks have already been stored. 
However, at useful D2D scales (e.g., 10-100 TB), it is 
impractical to keep such a large index in RAM and a 
disk-based index with one seek per incoming chunk is 
far too slow (see Section 2.3). 

This problem has been addressed in the literature by 
Zhu etal. [28], who tackle it by using an in-memory 
Bloom Filter and caching index fragments, where each 
fragment indexes a set of chunks found together in the 
input. In this paper, we show a different way of solving 
this problem in the context of data stream deduplication 
(the D2D case). Our solution has the advantage that it 
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uses significantly less RAM than Zhu et al.’s approach. 

To solve the chunk-lookup disk bottleneck problem, 
we rely on chunk locality: the tendency for chunks in 
backup data streams to reoccur together. That is, if the 
last time we encountered chunk A, it was surrounded by 
chunks B, C, and D, then the next time we encounter A 
(even in a different backup) it is likely that we will also 
encounter B, C, or D nearby. This differs from traditional 
notions of locality because occurrences of A may be sep- 
arated by very long intervals (e.g., terabytes). A derived 
property we take advantage of is that if two pieces of 
backup streams share any chunks, they are likely to share 
many chunks. 

We perform stream deduplication by breaking up each 
input stream into segments, each of which contains thou- 
sands of chunks. For each segment, we choose a few 
of the most similar segments that have been stored previ- 
ously. We deduplicate each segment against only its cho- 
sen few segments, thus avoiding the need for a full chunk 
index. Because of the high chunk locality of backup 
streams, this still provides highly effective deduplication. 

To identify similar segments, we use sampling and a 
sparse index. We choose a small portion of the chunks 
as samples; our sparse index maps these samples’ hashes 
to the already-stored segments in which they occur. By 
using an appropriate low sampling rate, we can ensure 
that the sparse index is small enough to fit easily into 
RAM while still obtaining excellent deduplication. At 
the same time, only a few seeks are required per segment 
to load its chosen segments’ information avoiding any 
disk bottleneck and achieving good throughput. 

Of course, since we deduplicate each segment against 
only a limited number of other segments, we occasion- 
ally store duplicate chunks. However, due to our lower 
RAM requirements, we can afford to use smaller chunks, 
which more than compensates for the loss of dedupli- 
cation the occasional duplicate chunk causes. The ap- 
proach described in this paper has recently been incorpo- 
rated into a number of Hewlett-Packard backup products. 

The rest of this paper is organized as follows: in the 
next section, we provide more background. In Section 3, 
we describe our approach to doing chunk-based dedu- 
plication. In Section 4, we report on various simula- 
tion experiments with real data, including a comparison 
with Zhu etal., and on the ongoing productization of this 
work. Finally, we describe related work in Section 5 and 
our conclusions in Section 6. 


2 Background 


2.1 D2D usage 


There are two modes in which D2D is performed today, 
using a network-attached-storage (NAS) protocol and us- 
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ing a Virtual Tape Library (VTL) protocol: 

In the NAS approach, the backup device is treated as a 
network-attached storage device, and files are copied to 
it using protocols such as NFS and CIFS. To achieve high 
throughput, typically large directory trees are coalesced, 
using a utility such as tar, and the resulting tar file stored 
on the backup device. Note that tar can operate either in 
incremental or in full mode. 

The VTL approach is for backward compatibility with 
existing backup agents. There is a large installed base of 
thousands of backup agents that send their data to tape 
libraries using a standard tape library protocol. To make 
the job of migrating to disk-based backup easier, ven- 
dors provide Virtual Tape Libraries: backup storage de- 
vices that emulate the tape library protocol for I/O, but 
use disk-based storage internally. 

In both NAS and VTL-based D2D, the backup data is 
presented to the backup storage device as a stream. In 
the case of VTL, the stream is the virtual tape image, 
and in the case of NAS-based backup, the stream is the 
large tar file that is generated by the client. In both cases, 
the stream can be quite large: a single tape image can be 
400 GB, for example. 


2.2 Inline versus out-of-line deduplication 


Inline deduplication refers to deduplication processes 
where the data is deduplicated as it arrives and before 
it hits disk, as opposed to out-of-line (also called post- 
process) deduplication where the data is first accumu- 
lated in an on-disk holding area and then deduplicated 
later in batch mode. With out-of-line deduplication, the 
chunk-lookup disk bottleneck can be avoided by using 
batch processing algorithms, such as hash join [24], to 
find chunks with identical hashes. 

However, out-of-line deduplication has several disad- 
vantages compared to inline deduplication: (a) the need 
for an on-disk holding area large enough to hold an en- 
tire backup window’s worth of raw data can substantially 
diminish storage capacity,! (b) all the functionality that a 
D2D device provides (data restoration, data replication, 
compression, etc.) must be implemented and/or tested 
separately for the raw holding area as well as the dedupli- 
cated store, and (c) it is not possible to conserve network 
or disk bandwidth because every chunk must be written 
to the holding area on disk. 


2.3. The chunk-lookup disk bottleneck 


The traditional way to implement inline, chunk-based 
deduplication is to use a full chunk index: a key-value 
index of all the stored chunks, where the key is a chunk’s 
hash, and the value holds metadata about that chunk, in- 
cluding where it is stored on disk [22, 14]. When an 
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incoming chunk is to be stored, its hash is looked up in 
the full index, and the chunk is stored only if no entry is 
found for its hash. We refer to this approach as the full 
index approach. 

Using a small chunk size is crucial for high-quality 
chunk-based deduplication because most duplicate data 
regions are not particularly large. For example, for our 
data set Workgroup (see Section 4.2), switching from 4 
to 8 KB average-size chunks reduces the deduplication 
factor (original size/deduplicated size) from 13 to 11; 
switching to 16 KB chunks further reduces it to 9. 

This need for a small chunk size means that the full 
chunk index consumes a great deal of space for large 
stores. Consider, for example, a store that contains 10 TB 
of unique data and uses 4 KB chunks. Then there are 
2.7 x 10° unique chunks. Assuming that every hash en- 
try in the index consumes 40 bytes, we need 100 GB of 
storage for the full index. 

It is not cost effective to keep all of this index in 
RAM. However, if we keep the index on disk, due to 
the lack of short-term locality in the stream of incoming 
chunk hashes, we will need one disk seek per chunk hash 
lookup. If a seek on average takes 4 ms, this means we 
can look up only 250 chunks per second for a process- 
ing rate of 1 MB/s, which is not acceptable. This is the 
chunk-lookup disk bottleneck that needs to be avoided. 


3 Our Approach 


Under the sparse indexing approach, segments are the 
unit of storage and retrieval. A segment is a sequence 
of chunks. Data streams are broken into segments in a 
two step process: first, the data stream is broken into 
a sequence of variable-length chunks using a chunking 
algorithm, and, second, the resulting chunk sequence is 
broken into a sequence of segments using a segmenting 
algorithm. Segments are usually on the order of a few 
megabytes. We say that two segments are similar if they 
share a number of chunks. 

Segments are represented in the store using their mani- 
fests: a manifest or segment recipe [25] is a data structure 
that allows reconstructing a segment from its chunks, 
which are stored separately in one or more chunk con- 
tainers to allow for sharing of chunks between segments. 
A segment’s manifest records its sequence of chunks, 
giving for each chunk its hash, where it is stored on disk, 
and possibly its length. Every stored segment has a man- 
ifest that is stored on disk. 

Incoming segments are deduplicated against similar, 
existing segments in the store. Deduplication proceeds 
in two steps: first, we identify among all the segments in 
the store some that are most similar to the incoming seg- 
ment, which we call champions, and, second, we dedu- 
plicate against those segments by finding the chunks they 
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Figure 1: Block diagram of the deduplication process 





share with the incoming segment, which do not need to 
be stored again. 

To identify similar segments, we sample the chunk 
hashes of the incoming segment, and use an in-RAM in- 
dex to determine which already-stored segments contain 
how many of those hashes. A simple and fast way to 
sample is to choose as a sample every hash whose first n 
bits are zero; this results in an average sampling rate of 
1/2”; that is, on average one in 2” hashes is chosen as a 
sample. We call the chosen hashes hooks. 

The in-memory index, called the sparse index, maps 
hooks to the manifests in which they occur. The mani- 
fests themselves are kept on disk; the sparse index holds 
only pointers to them. Once we have chosen cham- 
pions, we can load their manifests into RAM and use 
them to deduplicate the incoming segment. Note that al- 
though we choose champions because they share hooks 
with the incoming segment (and thus, the chunks with 
those hashes), as a consequence of chunk locality they 
are likely to share many other chunks with the incoming 
segment as well. 

We will now describe the deduplication process in 
more detail. A block diagram of the process can be found 
in Figure 1. 


3.1 Chunking and segmenting 


Content-based chunking has been studied at length in the 
literature [1, 16, 20]. We use our Two-Threshold Two- 
Divisor (TTTD) chunking algorithm [13] to subdivide 
the incoming data stream into chunks. TTTD produces 
variable-sized chunks with smaller size variation than 
other chunking algorithms, leading to superior dedupli- 
cation. 

We consider two different segmentation algorithms in 
this paper, each of which takes a target segment size as 
a parameter. The first algorithm, fixed-size segmentation, 
chops the stream of incoming chunks just before the first 
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chunk whose inclusion would make the current segment 
longer than the goal segment length. “Fixed-sized” seg- 
ments thus actually have a small amount of size varia- 
tion because we round down to the nearest chunk bound- 
ary. We believe that it is important to make segment 
boundaries coincide with chunk boundaries to avoid split 
chunks, which have no chance of being deduplicated. 

Because we perform deduplication by finding seg- 
ments similar to an incoming segment and deduplicating 
against them, it is important that the similarity between 
an incoming segment and the most similar existing seg- 
ments in the store be as high as possible. Fixed-size seg- 
mentation does not perform as well here as we would 
like because of the boundary-shifting problem [13]: Con- 
sider, for example, two data streams that are identical ex- 
cept that the first stream has an extra half-a-segment size 
worth of data at the front. With fixed-size segmentation, 
segments in the second stream will only have 50% over- 
lap with the segments in the first stream, even though the 
two streams are identical except for some data at the start 
of the first stream. 

To avoid the segment boundary-shifting problem, 
our second segmentation algorithm, variable-size seg- 
mentation, uses the same trick used at the chunking 
level to avoid the boundary-shifting problem: we base 
the boundaries on landmarks in the content, not dis- 
tance. Variable-size segmentation operates at the level 
of chunks (really chunk hashes) rather than bytes and 
places segment boundaries only at existing chunk bound- 
aries. The start of a chunk is considered to represent a 
landmark if that chunk’s hash modulo a predetermined 
divisor is equal to -1. The frequency of landmarks—and 
hence average segment size—can be controlled by vary- 
ing the size of the divisor. 

To reduce segment-size variation, variable-size seg- 
mentation uses TTTD applied to chunks instead of data 
bytes. The algorithm is the same, except that we move 
one chunk at a time instead of one byte at a time, and 
that we use the above notion of what a landmark is. Note 
that this ignores the lengths of the chunks, treating long 
and short chunks the same. We obtain the needed TTTD 
parameters (minimum size, maximum size, primary di- 
visor, and secondary divisor) in the usual way from the 
desired average size. Thus, for example, with variable- 
size segmentation, mean size 10 MB segments using 4 
KB chunks have from 1,160 to 7,062 chunks with an av- 
erage of 2,560 chunks, each chunk of which, on average, 
contains 4 KB of data. 


3.2 Choosing champions 


Looking up the hooks of an incoming segment S in the 
sparse index results in a possible set of manifests against 
which that segment can be deduplicated. However, we 
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do not necessarily want to use all of those manifests to 
deduplicate against, since loading manifests from disk is 
costly. In fact, as we show in Section 4.3, only a few well 
chosen manifests suffice. So, from among all the mani- 
fests produced by querying the sparse index, we choose a 
few to deduplicate against. We call the chosen manifests 
champions. 

The algorithm by which we choose champions is as 
follows: we choose champions one at a time until the 
maximum allowable number of champions are found, or 
we run out of candidate manifests. Each time we choose, 
we choose the manifest with the highest non-zero score, 
where a manifest gets one point for each hook it has in 
common with S' that is not already present in any previ- 
ously chosen champion. If there is a tie, we choose the 
manifest most recently stored. The choice of which man- 
ifests to choose as champions is done based solely on the 
hooks in the sparse index; that is, it does not involve any 
disk accesses. 

We don’t give points for hooks belonging to already 
chosen manifests because those chunks (and the chunks 
around them by chunk locality) are most likely already 
covered by the previous champions. Consider the fol- 
lowing highly-simplified example showing the hooks of 
S and three candidate manifests (m1—m3): 





S |bie/|d/;/;e|min 








m,|al|bj/e;|dj|e/|f 








maz|zila/bie|d/f 





























mzg|m|ni}o;p|qir 





The manifests are shown in descending order of how 
many hooks they have in common with S (common 
hooks shown in bold). Our algorithm chooses mj, then 
mg, which together cover all the hooks of S, unlike m, 
and m2. 


3.3. Deduplicating against the champions 


Once we have determined the champions for the incom- 
ing segment, we load their manifests from disk. A small 
cache of recently loaded manifests can speed this process 
up somewhat because adjacent segments sometimes have 
champions in common. 

The hashes of the chunks in the incoming segment 
are then compared with the hashes in the champions’ 
manifests in order to find duplicate chunks. We use the 
SHAL hash algorithm [15] to make false positives here 
extremely unlikely. Those chunks that are found not to 
be present in any of the champions are stored on disk in 
chunk containers, and a new manifest is created for the 
incoming segment. The new manifest contains the loca- 
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tion on disk where each incoming chunk is stored. In 
the case of chunks that are duplicates of a chunk in one 
or more of the champions, the location is the location of 
the existing chunk, which is obtained from the relevant 
manifest. In the case of new chunks, the on-disk location 
is where that chunk has just been stored. Once the new 
manifest is created, it is stored on disk in the manifest 
store. 


Finally, we add entries for this manifest to the sparse 
index with the manifest’s hooks as keys. Some of the 
hooks may already exist in the sparse index, in which 
case we add the manifest to the list of manifests that are 
pointed to by that hook. To conserve space, it may be 
desirable to set a maximum limit for the number of man- 
ifests that can be pointed to by any one hook. If the max- 
imum is reached, the oldest manifest is removed from the 
list before the newest one is added. 


3.4 Avoiding the chunk-lookup disk bottle- 
neck 


Notice that there is no full chunk index in our approach, 
either in RAM or on disk. The only index we maintain 
in RAM, the sparse index, is much smaller than a full 
chunk index: for example, if we only sample one out of 
every 128 hashes, then the sparse index can be 128 times 
smaller than a full chunk index. 

We do have to make a handful of random disk accesses 
per segment in order to load in champion manifests, but 
the cost of those seeks is amortized over the thousands of 
chunks in each segment, leading to acceptable through- 
put. Thus, we avoid the chunk-lookup disk bottleneck. 


3.5 Storing chunks 


We do not have room in this paper, alas, to describe how 
best to store chunks in chunk containers. The scheme de- 
scribed in Zhu et al. [28], however, is a pretty good start- 
ing point and can be used with our approach. They main- 
tain an open chunk container for each incoming stream, 
appending each new (unique) chunk to the open con- 
tainer corresponding to the stream it is part of. When a 
chunk container fills up (they use a fixed size for efficient 
packing), a new one is opened up. 

This process uses chunk locality to group together 
chunks likely to be retrieved together so that restoration 
performance is reasonable. Supporting deletion of seg- 
ments requires additional machinery for merging mostly 
empty containers, garbage collection (when is it safe to 
stop storing a shared chunk?), and possibly defragmen- 
tation. 


3.6 Using less bandwidth 


We have described a system where all the raw backup 
data is fed across the network to the backup system and 
only then deduplicated, which may consume a lot of 
network bandwidth. It is possible to use substantially 
less bandwidth at the cost of some client-side process- 
ing if the legacy backup clients could be modified or 
replaced. One way of doing this is to have the backup 
client perform the chunking, hashing, and segmenta- 
tion locally. The client initially sends only a segment’s 
chunks’ hashes to the back-end, which performs cham- 
pion choosing, loads the champion manifests, and then 
determines which of those chunks need to be stored. The 
back-end notifies the client of this and the client sends 
only the chunks that need to be stored, possibly com- 
pressed. 


4 Experimental Results 


In order to test our approach, we built a simulator that 
allows us to experiment with a number of important pa- 
rameters, including some parameter values that are infea- 
sible in practice (e.g., using a full index). We apply our 
simulator to two realistic data sets and report below on 
locality, overall deduplication, RAM usage, and through- 
put. We also report briefly on some optimizations and an 
ongoing productization that validates our approach. 


4.1 Simulator 


Our simulator takes as input a series of (chunk hash, 
length) pairs, divides it into segments, determines the 
champions for each segment, and then calculates the 
amount of deduplication obtained. Available knobs in- 
clude type of segmentation (fixed or variable size), mean 
segment size, sampling rate, maximum number of cham- 
pions loadable per segment, how many manifest IDs to 
keep per hook in the sparse index, and whether or not to 
use a simple form of manifest caching (see Section 4.7). 

We (or others when privacy is a concern) run a small 
tool we have written called chunklite in order to produce 
chunk information for the simulator. Chunklite reads 
from either a mounted tape or a list of files, chunking 
using the TTTD chunking algorithm [13]. Except where 
we Say otherwise, all experiments use chunklite’s default 
4 KB mean chunk size,” which we find a good trade-off 
between maximizing deduplication and minimizing per- 
chunk overhead. 

The simulator produces various statistics, including 
the sum of lengths of every input chunk (original size) 
and the sum of lengths of every non-removed chunk 
(deduplicated size). The estimated deduplication factor 
is then original size/deduplicated size. 
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4.2 Data sets 


We report results for two data sets. The first data set, 
which we call Workgroup, is composed of a semi-regular 
series of backups of the desktop PCs of a group of 20 
engineers taken over a period of three months. Although 
the original collection included only an initial full and 
later weekday incrementals for each machine, we have 
generated synthetic fulls at the end of each week for 
which incrementals are available by applying that week’s 
incrementals to the last full. The synthetic fulls replace 
the last incremental for their week; in this way, we sim- 
ulate a more typical daily incremental and weekly full 
backup schedule. We are unable to simulate file deletions 
because this information is missing from the collection. 


Altogether, there are 154 fulls and 392 incrementals 
in this 3.8 TB data set, which consists of each of these 
backup snapshots tar’ed up without compression in the 
order they were taken. We believe this data set is repre- 
sentative of a small corporate workgroup being backed 
up via tar directly to a NAS interface. Note that because 
these machines are only powered up during workdays 
and because the synthetic fulls replace the last day of the 
week’s back up, the ratio of incrementals to fulls (2.5) is 
lower than would be the case for a server (6 or 7). 

The second data set, which we call SMB, is intended, 
by contrast, to be representative of a small or medium 
business server backed up to virtual tape. It contains 
two weeks (3 fulls, 12 incrementals) of Oracle & Ex- 
change data backed up via Symantec’s NetBackup to vir- 
tual tape. The Exchange data was synthetic data gener- 
ated by the Microsoft Exchange Server 2003 Load Simu- 
lator (LoadSim) tool [19], while the Oracle data was cre- 
ated by inserting rows from a real 1+ TB Oracle database 
belonging to a compliance test group combined with 
a small number of random deletes and updates. This 
data set occupies 0.6 TB and has less duplication than 
one might expect because Exchange already uses sin- 
gle instance storage (each message is stored only once 
no matter how many users receive it) and because Net- 
Backup does true Exchange incrementals, saving only 
new/changed messages. 

We have chosen data sets with daily incrementals and 
weekly fulls rather than just daily fulls because such data 
sets are harder to deduplicate well, and thus provide a 
better test of any deduplication system. Incrementals are 
harder to deduplicate because they contain less duplicate 
material and because they have less locality: any given 
incremental segment likely contains files from many seg- 
ments of the previous full whereas a full segment may 
only contain files from one or two segments of the pre- 
vious full. Series of all fulls do generate higher dedupli- 
cation factors, beloved of marketing departments every- 
where, however. 
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Figure 2: Conservative estimate (GREEDY) of dedu- 
plication effectiveness obtainable by deduplicating 
each segment against up to / prior segments for data 
set Workgroup. Shown are results for 5 different seg- 
ment sizes, with all segments chosen via variable-size 
segmentation. 





4.3 Locality 


In order for our approach to work, there must be suffi- 
cient locality present in real backup streams. In particu- 
lar, we need locality at the scale of our segment size so 
that most of the deduplication possible for a given seg- 
ment can be obtained by deduplicating it against a small 
number of prior segments. The existence of such local- 
ity is a necessary, but not sufficient condition: the exis- 
tence of such segments does not automatically imply that 
sparse indexing or any other method can efficiently find 
them. 

Whether or not such locality exists is an empirical 
question, which we find to be overwhelmingly answered 
in the affirmative. Figures 2 and 3 show a conservative 
estimate of this locality for our data sets for a variety 
of segment sizes. Here, we show how well segment- 
based deduplication could work given near-perfect seg- 
ment choice when each segment of the given data set 
can only be deduplicated against a small number // of 
prior segments. We measure deduplication effectiveness 
by the percentage of duplicate chunks that deduplication 
fails to remove; the smaller this number, the better the 
deduplication. 

Because computing the optimal segments to dedupli- 
cate against is infeasible in practice, we instead estimate 
the deduplication effectiveness possible by using a sim- 
ple greedy algorithm (GREEDY) that chooses the seg- 
ments to deduplicate a given segment S' against one at a 
time, each time choosing the segment that will produce 
the maximum additional deduplication. While GREEDY 
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Figure 3: Conservative estimate (GREEDY) of dedu- 
plication effectiveness obtainable by deduplicating 
each segment against up to V/ prior segments for data 
set SMB. Shown are results for 5 different segment sizes, 
with all segments chosen via variable-size segmentation. 





does an excellent job of choosing segments, it consumes 
too much RAM to ever be practical. 

As you can see, there is a great deal of locality at 
these scales: deduplicating each input segment against 
only 2 prior segments can suffice to remove all but 1% of 
the duplicate chunks (0.1% requires only 3 more). Not 
shown is the zero segment case (M= 0) where 93-98% 
of duplicate chunks remain due to duplication within seg- 
ments (segments are automatically deduplicated against 
themselves). Larger segment sizes yield slightly less lo- 
cality, presumably because larger pieces of incrementals 
include data from more faraway places. 

Likely sources of locality in backup streams include 
writing out entire large items even when only a small part 
has changed (e.g., Microsoft Outlook’s mostly append- 
only PST files, which are often hundreds of megabytes 
long), locality in the order items are scanned (e.g., al- 
ways scanning files in alphabetical order), and the ten- 
dency for changes to be clustered in small areas. 


4.4 Overall deduplication 


How much of this locality are we able to exploit using 
sampling? Figure 4 addresses this point by showing for 
data set Workgroup and 10 MB variable size segments 
how much of the possible deduplication efficiency we 
obtain. Even with a sampling rate as low as 1/128, we 
remove all but 1.4% of the duplicate data given a max- 
imum of 10 or more champions per segment (0.7% for 
1/64). 

Figures 5 and 6 show the overall deduplication pro- 
duced by applying our approach to the two data sets. 
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Figure 4: Deduplication efficiency obtained by using 
sparse indexing with 10 MB average-sized segments 
for various maximum numbers of champions (M) and 
sampling rates for data set Workgroup. Shown for 
comparison is GREEDY’s results given the same data. 
Variable-size segmentation was used. 





As can be seen, the degree of deduplication achieved 
falls off as the sampling rate decreases and as the seg- 
ment size decreases. The amount of deduplication re- 
mains roughly constant as sampling rate is traded off 
against segment size: halving the sampling rate and 
doubling the mean segment size leaves deduplication 
roughly the same. This can be seen most easily in Fig- 
ure 7, which plots overall deduplication versus the av- 
erage number of hooks per segment (equal to segment 
size/chunk sizex sampling rate). We believe this rela- 
tionship reflects t7 he fact that deduplication quality us- 
ing sparse indexing depends foremost on the number of 
hooks per segment. 

Note that these figures show simulated deduplication, 
not real deduplication. In particular, they take into ac- 
count only the space required to keep the data of the 
non-deduplicated chunks. Including container padding, 
the space required to store manifests, and other over- 
head would reduce these numbers somewhat. Similar 
overhead exists in all backup systems that use chunk- 
based deduplication. On the other hand, these numbers 
do not include any form of local compression of chunk 
data. In practice, chunks would be compressed (e.g., by 
Ziv-Lempel [29]), either individually or in groups, be- 
fore storing to disk. Such compression usually adds an 
additional factor of 1.5—2.5. 

Using variable instead of fixed-size segmentation im- 
proves deduplication using sparse indexing as can be 
seen from Figure 8. This improvement is due to in- 
creased locality: with fixed-size segmentation, there are 
more segments that produce substantial deduplication 
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Figure 5: Deduplication produced using sparse index- 
ing with up to 10 champions (//=10) for various sam- 
pling rates and segment sizes for data set Workgroup. 
For each point, the deduplication factor (deduplicated 
size/original size) is shown. Shown for comparison is 
perfect 4 KB deduplication, wherein all duplicate chunks 
are removed. Variable-size segmentation was used. 
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Figure 6: Deduplication produced using sparse in- 
dexing with up to 10 champions (1/=10) for various 
sampling rates and segment sizes for data set SMB. 
For each point, the deduplication factor (deduplicated 
size/original size) is shown. Shown for comparison is 
perfect 4 KB deduplication, wherein all duplicate chunks 
are removed. Variable-size segmentation was used. 
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Figure 7: Deduplication produced using sparse index- 
ing with up to 10 champions (1/=10) versus the av- 
erage number of hooks per segment for various sam- 
pling rates and segment sizes for data set Workgroup. 
For each segment size, sampling rates (from right to left) 
of 1/32, 1/64, 1/128, 1/256, 1/512, 1/1024, 1/2048, and 
1/4096 are shown. Variable-size segmentation was used. 
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with 10 MB average size segments for selected sam- 
pling rates for data set Workgroup. 
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that must be found in order to achieve high levels of 
deduplication quality. Because this introduces more op- 
portunities for serious mistakes (e.g., missing such a seg- 
ment due to poor sampling), sparse indexing does sub- 
stantially worse with fixed-size segmentation. 


4.5 RAM usage and comparison with Zhu 
etal. 


Since one of the main objectives of this paper is to ar- 
gue that our approach significantly reduces RAM usage 
for comparable deduplication and throughput to existing 
approaches, we briefly describe here the approach used 
by Zhu etal. [28], which we call the Bloom Filter with 
Paged Full Index (BFPFI) approach. 

BFPFI uses a full disk-based index of every chunk 
hash. To avoid having to access the disk for every hash 
lookup, it employs a Bloom Filter and a cache of chunk 
container indexes. The Bloom Filter uses one byte of 
RAM per hash and contains the hash of every chunk in 
the store. If the Bloom filter does not indicate that an 
incoming chunk is already in the store, then there is no 
need to consult the chunk index. Otherwise, the cache is 
searched and only if it fails to contain the given chunk’s 
hash, is the on-disk full chunk index consulted. Each 
time the on-disk index must be consulted, the index of 
the chunk container that contains the given chunk (if any) 
is paged into memory. 

The hit rate of the BFPFI cache (and hence the overall 
throughput) depends on the degree of chunk locality of 
the input data: because chunk containers contain chunks 
that occurred together before, high chunk locality im- 
plies a high hit rate. The only parameter that impacts the 
deduplication factor in BFPFI is the average chunk size, 
since it finds all the duplicate chunks. Smaller chunk 
sizes increase the deduplication factor at the cost of re- 
quiring more RAM for the Bloom filter. 

Both approaches degrade under conditions of poor 
chunk locality: with BFPFI, throughput degrades, 
whereas with sparse indexing, deduplication quality de- 
grades. Unlike with BFPFI, with sparse indexing it is 
possible to guarantee a minimum throughput by impos- 
ing a maximum number of champions, which can be im- 
portant given today’s restricted backup window times. It 
is, of course, impossible to guarantee a minimum dedu- 
plication factor because the maximum deduplication pos- 
sible is limited by characteristics of the input data that are 
beyond the control of any store. 

The amount of RAM required by one of our sparse 
indexes or the Bloom filter of the BFPFI approach is 
linearly proportional to the maximum possible number 
of unique chunks in that store. Accordingly, we plot 
RAM usage as the ratio of RAM required per amount 
of physical disk storage. Figure 9 shows the estimated 
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Figure 9: RAM space required per 100 TB of disk for 
sparse indexing with up to 10 champions (1/=10) and 
for a Bloom filter. For each point, the deduplication 
factor for data set Workgroup is shown. Each sparse in- 
dexing series shows points for sampling rates of (right to 
left) 1/32, 1/64, 1/128, 1/256, and 1/512 while the Bloom 
filter series shows points for chunk sizes of 4, 8, 16, and 
32 KB. Variable-size segmentation was used. 





RAM usage of sparse indexes with 5, 10, and 20 MB 
variable-sized segments as well as the Bloom filter used 
by BFPFI. We assume here a local compression factor of 
2, which allows 100 TB of disk to store twice as many 
chunks as would otherwise be possible. Because it is 
easy to achieve good RAM usage if deduplication quality 
can be neglected, we also show the deduplication factor 
for data set Workgroup for each case. 

You will note that our approach uses substantially less 
RAM than BFPFI for the same quality of deduplication. 
For example, for a store with 100 TB of disk, a sparse 
index with 10 MB segments and 1/64 sampling requires 
17 GB whereas we estimate a Bloom filter would require 
36 GB for an equivalent level of deduplication. Alter- 
natively, starting with a Bloom filter using 8 KB chunks 
(the value used by Zhu et al. [28]), which requires 25 GB 
for 100 TB, we estimate we can get the same deduplica- 
tion quality (using 4 KB chunks) but use only 10 GB 
(10 MB segments) or 6 GB (20 MB segments) of RAM. 
For comparison, the Jumbo Store [14], which keeps a 
full chunk index in RAM, would need 1,500 GB for the 
second case. 

A sparse index has one key for each unique hook en- 
countered; when using a sampling rate of 1/s, on average 
1/s of unique chunks will have a hash which qualifies 
as a hook. Because of the random nature of hashes and 
the law of large numbers (we are dealing with billions 
of unique chunks), we can treat this average as a maxi- 
mum for estimation purposes. To conserve RAM needed 
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by our simulated sparse indexes, we generally limit the 
number of manifest IDs per hook in our sparse index- 
ing experiments to 1; that is, for each hook, we simulate 
keeping the ID of only the last manifest containing that 
hook. This slightly decreases deduplication quality (see 
Section 4.7), but saves a lot of RAM. 

Such a sparse index needs to be big enough to hold 
u/s keys, each of which has exactly | entry, where wu 
is the maximum number of unique chunks possible. The 
sampling rate is thus the primary factor controlling RAM 
usage for our experiments. The actual space is ku/s 
where & is a constant depending on the exact data struc- 
ture implementation used. For this figure, we have used 
k; = 21.7 bytes based on using a chained hash table with 
a maximum 70% load factor, 4-byte internal pointers, 8- 
byte manifest IDs, and 4 key check bytes per entry. Us- 
ing only a few bytes of the key saves substantial RAM 
but means the index can—very rarely—make mistakes; 
this may occasionally result in the wrong champion be- 
ing selected, but is unlikely to substantially alter the over- 
all deduplication quality. We calculate the size of the 
BFPFI Bloom filter per Zhu etal. as | byte per unique 
chunk [28]. 

Additional RAM is needed for both approaches for per 
stream buffers. In our case, the per stream space is pro- 
portional to the segment size. 


4.6 Throughput 


Because we do not move around or even simulate mov- 
ing around chunk data, we cannot estimate overall read 
or write throughput. However, because we do collect 
statistics on how many champions are loaded per seg- 
ment, we can estimate the I/O burden that loading cham- 
pions places on a system using our approach. Aside from 
this I/O and writing out new manifests, the only other 
I/O our system needs to do when ingesting data is that 
required by any other deduplicating store: reading in the 
input data and writing out the non-deduplicated chunks. 

Similarly, the majority of the computation required by 
our approach—chunking, hashing, and compression— 
also must be done by any chunk-based deduplication en- 
gine. For an alternative way of getting a handle on the 
throughput our approach can support, see Section 4.8 
where we briefly describe some early throughput mea- 
surements of a product embodying our approach. 

Figure 10 shows the average number of champion 
manifests actually loaded per segment for the data set 
Workgroup with up to 10 champions per segment al- 
lowed. The equivalent chart for SMB (not shown) is 
similar, but scaled down by a factor of 2/3. You will 
notice that that the average number loaded is substan- 
tially less than the maximum allowed, 10. This is pri- 
marily because most segments in these data sets can 


7th USENIX Conference on File and Storage Technologies 












































2:2 
-+ 1MB mean segment size 

5 2.0 -# 2.5 MB mean segment size 
£ -«- 5 MB mean segment size 
D 18 TG | ++ 10 MB mean segment size 
3 j 
5 16 ~— ~*« 20 MB mean segment size 
Qa Se ba 
o 14 = 
212 a 
e Be 
£10 
6 0.8 
ay De > 
tH 0.6 
o 
Do4 
o 
$ 02 Ss 

0.0 1 1 ; 








1/32 1/64 11128 1/256 1/512 1/1024 1/2048 1/4096 


Sampling rate 


Figure 10: Average number of champions actually 
loaded per segment using sparse indexing with up to 
10 champions (1/=10) for various sampling rates and 
segment sizes for data set Workgroup. Variable-size 
segmentation was used. 





be completely deduplicated using only a few champions 
(GREEDY does not load substantially more champions 
than 1/32-sampling). Lower sampling rates result in even 
fewer champions being loaded because sparser indexes 
result in fewer candidate champions being identified. 


Loading a champion manifest requires a random seek 
followed by a quite small amount of sequential reading 
(manifests are a hundredth of the size of segments and 
measured in KBs). Accordingly, the I/O burden due to 
loading champions is best measured in terms of the av- 
erage number of seeks (equivalently champions loaded) 
per unit of input data. Figure 11 shows this informa- 
tion for the Workgroup data set. Note that the ordering 
of segment sizes has reversed: although bigger segments 
load more champions each, they load their champions 
so much less frequently that their overall rate of loading 
champions per megabyte of input data, and hence, their 
I/O burden is less. 


If we conservatively assume that loading a champion 
manifest takes 20 ms and that we load 0.2 champions 
per megabyte on average, then a single drive doing noth- 
ing else could support a rate of 1/(0.2 - 20 ms/MB) = 
250 MB/s. Of course, as we mentioned above, there is 
other I/O that needs to be done as well. However, in prac- 
tice deduplication systems are usually deployed with 10 
or more drives so the real number before other I/O needs 
is more like 2.5 GB/s. This is a sufficiently light burden 
that we expect that some other component of the system 
will be the throughput bottleneck. 
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Figure 11: Average number of champions actually 
loaded per 1 MB of input data using sparse indexing 
with up to 10 champions (J/=10) for various sam- 
pling rates and segment sizes for data set Workgroup. 
Variable-size segmentation was used. 





4.7 Optimization 


Figure 12 shows how capping the number of manifest 
IDs kept per hook in the sparse index affects deduplica- 
tion quality. Keeping only one manifest ID per hook does 
reduce deduplication somewhat (99.29% versus 99.44% 
duplicate chunks removed for MM = 10 here), but greatly 
reduces the amount of RAM required for the sparse in- 
dex. 

All the experiments we have reported on do not use 
any manifest caching at all. While the design of our sim- 
ulator unfortunately makes it hard to implement manifest 
caching correctly (we compute the 7th champion for each 
segment in parallel), we were able to conservatively ap- 
proximate a cache just large enough to hold the cham- 
pions from the previous segment.* We find that even 
with this suboptimal implementation, manifest caching 
reduces champions loaded per segment and slightly im- 
proves deduplication quality. For 10 MB variable size 
segments, Mf = 10, and | in 64 sampling for data set 
Workgroup, for example, our version of manifest caching 
lowers the average number of champions loaded per seg- 
ment by 3.9% and improves the deduplication factor by 
1.1%. 


4.8 Productization 


Our approach is being used to build a family of VTL 
products that use deduplication internally to increase the 
amount of data they can support. Already on the mar- 
ket are the HP D2D2500 and the D2D4000. Most of the 
work described in this paper, however, was done before 


USENIX Association 


6% 








1 manifests/hook 
2 manifests/hook 
3 manifests/hook 
4 manifests/hook 
—« 5 manifests/hook 
-* 10 manifests/hook 
+  nolimit 


——— 








5% 


ee 
-_ 
+ 
oe 


| Ft 


[ee 
[heme] 


4% 

















| 


h 
SSS 


12 13. 14 15 


1 2 3 4 5 6 7 8 9 10 11 1% 
maximum # of segments compacted against (M) 


3% 





ig a 


2% 














v 


1% 








% of duplicate chunks not removed 


i 
































0% 





Figure 12: Deduplication efficiency obtained as the 
maximum number of manifest IDs kept per hook 
varies for 10 MB average size segments and a sampling 
rate of 1/64 for data set Workgroup. Variable-size seg- 
mentation was used. 





even a prototype of these products was available. 

A third-party testing firm, Binary Testing Ltd., was 
hired by HP to test the D2D4000’s deduplication perfor- 
mance [5]. The D2D4000 configuration they tested has 
6 750 GB disk drives running RAID 6, 8 GB RAM, 2 
AMD Opteron 3 Ghz dual core processors, and a 4 Gb 
fiber channel link. We report a few representative ex- 
cerpts from their report here: changing 0.4% of every file 
of a 4 GB file server data set every day and taking fulls 
for three months produced a deduplication factor of 69.2; 
the same change schedule applied to a 4 GB exchange 
server produced a factor of 24.9. Instead changing only 
20% of the items every day but by 5% yielded factors 
of 25.5 (for the file server) and 40.3 (for the Exchange 
server). Note that these numbers include all overhead 
and local compression. 

Preliminary throughput testing of a similar system 
with 12 750 GB disk drives shows write rates of 90 MB/s 
(1 stream) and 120 MB/s (4 streams) and read rates of 40- 
50 MB/s (1 stream) and 25-35 MB/s (4 streams). The re- 
store path was still being optimized when these measure- 
ment were taken, so those numbers may improve sub- 
stantially. We believe these product results validate our 
approach, and demonstrate that we have not overlooked 
any crucial points. 


5 Related Work 


Chunking has been used to weed out near duplicates in 
repositories [16], conserve network bandwidth [20], and 
reduce storage space requirements [1, 22, 23, 27]. It 
has also been used to synchronize large data sets reliably 
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while conserving network bandwidth [14, 17]. 

Archival and backup storage systems detect dupli- 
cate data at granularities that range from an entire file, 
as in EMC’s Centera [12], down to individual fixed- 
size disk blocks, as in Venti [22], and variable-size data 
chunks, as in the Low-Bandwidth Network File Sys- 
tem [20]. Variable-sized chunking has also been used 
in the commercial sector, for example, by Data Domain 
and Riverbed Technology. Deep Store [27] is a large- 
scale archival storage system that uses both delta com- 
pression [2, 11] and chunking to reduce storage space 
requirements. How much deduplication is obtained de- 
pends on the inherent content overlaps in the data, the 
granularity of chunks, and the chunking method [21]. 
Deduplication using chunking can be quite effective for 
data that evolves slowly (mainly) through small changes, 
additions, and deletions [26]. 

Chunking is just one of the methods in the literature 
used to detect similarities or content overlap between 
documents. Shingling [8] was developed by Broder for 
near duplicate detection in web pages. Manber [18], 
Brin etal. [7], and Forman et al. [16] have also developed 
techniques for finding similarities between documents in 
large repositories 

Various approaches have been used to reduce disk ac- 
cesses when querying an index. Database buffer man- 
agement strategies [10] that aim to efficiently maintain a 
‘working set’ of rows of the index in a buffer cache have 
been well researched. However, these strategies do not 
work in the case of chunk-based deduplication because 
chunk IDs are random hashes for which it is not possible 
to identify or maintain a working set. 

Bloom filters [6] have also been used to minimize in- 
dex accesses. A Bloom filter, which can give false posi- 
tives but not false negatives, can be used to determine the 
existence of a key in an index before actually querying 
the index. If the Bloom filter does not contain the key, 
then the index does not need to queried thereby elimi- 
nating both an index and possibly a disk access. Bloom 
filters have been used by large scale distributed storage 
systems such as Google’s BigTable [9] and by Data Do- 
main [28]. 

Besides using Bloom filters to improve the dedupli- 
cation throughput, Data Domain exploits chunk local- 
ity for index caching as well as for laying out chunks 
on disk. By using these techniques Data Domain can 
avoid a large number of disk accesses related to index 
queries. Venti uses a disk-based hash table divided into 
buckets where a hash function is used to map chunk 
hashes to appropriate buckets. To improve the index 
lookup performance, Venti uses caching, striping, and 
write buffering. Foundation [23] is an archival storage 
system that preserves users’ data and dependencies by 
capturing and storing regular snapshots of every users’ 
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virtual machine. Chunking is used to deduplicate the 
snapshots. Foundation also uses a combination of Bloom 
filters and locality-friendly on-disk layouts to improve 
the performance of index lookups. 


6 Conclusions 


D2D backup is increasingly becoming the backup so- 
lution of choice, and deduplication is an essential fea- 
ture of D2D backup. Our experimental evaluation has 
shown that there exists a lot of locality within backup 
data at the small number of megabytes scale. Our ap- 
proach exploits this locality to solve the chunk-lookup 
disk bottleneck problem. Through content-based seg- 
mentation, sampling, and sparse indexing, we divide in- 
coming streams into segments, identify similar existing 
segments, and deduplicate against them, yielding excel- 
lent deduplication and throughput while requiring little 
RAM. 

While our approach allows a few duplicate chunks to 
be stored, we more than make up for this loss of dedu- 
plication by using a smaller chunk size (possible because 
of the small RAM requirements), which produces greater 
deduplication. Compared with the BFPFI approach, we 
use less than half the RAM for an equivalent high level 
of deduplication. The practicality of our approach has 
been demonstrated by its being used as the basis of a HP 
product family. 
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Notes 


1 E.g., 30 fulls with 5% data change/day might when deduplicated 
occupy the space of 1 + 29-5% = 2.45 fulls so the extra full worth of 
space needed by out-of-line amounts to requiring 29% more disk space. 

2 Standard TTTD parameters yield for this chunk size a minimum 
chunk size of 1,856 and a maximum chunk size of 11,299 using primary 
divisor 2,179 and secondary divisor 1,099. 

3 We get the effect of a size- manifest cache by causing our sim- 
ulator to do the following: each time it chooses the ith champion for 
a given segment, it immediately also deduplicates the given segment 
against the ith champion of the immediately preceding segment as well. 
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Abstract 


The performance of file systems and related software de- 
pends on characteristics of the underlying file-system im- 
age (i.e., file-system metadata and file contents). Un- 
fortunately, rather than benchmarking with realistic file- 
system images, most system designers and evaluators 
rely on ad hoc assumptions and (often inaccurate) rules 
of thumb. Furthermore, the lack of standardization and 
reproducibility makes file system benchmarking ineffec- 
tive. To remedy these problems, we develop Impressions, 
a framework to generate statistically accurate file-system 
images with realistic metadata and content. Impressions 
is flexible, supporting user-specified constraints on vari- 
ous file-system parameters using a number of statistical 
techniques to generate consistent images. In this paper 
we present the design, implementation and evaluation 
of Impressions, and demonstrate its utility using desktop 
search as a case study. We believe Impressions will prove 
to be useful for system developers and users alike. 


1 Introduction 


File system benchmarking is in a state of disarray. In 
spite of tremendous advances in file system design, the 
approaches for benchmarking still lag far behind. The 
goal of benchmarking is to understand how the sys- 
tem under evaluation will perform under real-world con- 
ditions and how it compares to other systems; how- 
ever, recreating real-world conditions for the purposes of 
benchmarking file systems has proven challenging. The 
two main challenges in achieving this goal are generat- 
ing representative workloads, and creating realistic file- 
system state. 

While creating representative workloads is not an en- 
tirely solved problem, significant steps have been taken 
towards this goal. Empirical studies of file-system access 
patterns [4, 19, 33] and file-system activity traces [38, 
45] have led to work on synthetic workload genera- 
tors [2, 14] and methods for trace replay [3, 26]. 

The second, and perhaps more difficult challenge, is to 
recreate the file-system state such that it is representative 


of the target usage scenario. Several factors contribute 
to file-system state, important amongst them are the in- 
memory state (contents of the buffer cache), the on-disk 
state (disk layout and fragmentation) and the characteris- 
tics of the file-system image (files and directories belong- 
ing to the namespace and file contents). 

One well understood contributor to state is the in- 
memory state of the file system. Previous work has 
shown that the contents of the cache can have signifi- 
cant impact on the performance results [11]. Therefore, 
system initialization during benchmarking typically con- 
sists of a cache “warm-up” phase wherein the workload 
is run for some time prior to the actual measurement 
phase. Another important factor is the on-disk state of 
the file system, or the degree of fragmentation; it is a 
measure of how the disk blocks belonging to the file sys- 
tem are laid out on disk. Previous work has shown that 
fragmentation can adversely affect performance of a file 
system [44]. Thus, prior to benchmarking, a file system 
should undergo aging by replaying a workload similar to 
that experienced by a real file system over a period of 
time [44]. 

Surprisingly, one key contributor to file-system state 
has been largely ignored — the characteristics of the file- 
system image. The properties of file-system metadata 
and the actual content within the files are key contrib- 
utors to file-system state, and can have a significant im- 
pact on the performance of a system. Properties of file- 
system metadata includes information on how directories 
are organized in the file-system namespace, how files are 
organized into directories, and the distributions for vari- 
ous file attributes such as size, depth, and extension type. 
Consider a simple example: the time taken for a find 
operation to traverse a file system while searching for a 
file name depends on a number of attributes of the file- 
system image, including the depth of the file-system tree 
and the total number of files. Similarly, the time taken 
for a grep operation to search for a keyword also de- 
pends on the type of files (i.e., binary vs. others) and the 
file content. 

File-system benchmarking frequently requires this 
sort of information on file systems, much of which is 
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Paper 
HAC [17] 


IRON [36] None provided 


LBFS [30] 
LISFS [34] 


PAST [40] 
Pastiche [9] 
Pergamum [47] 
Samsara [10] 


Segank [46] 


directory 
SFS 
only [15] 
TFS [7] 


read- 
contain random data 


WAFL 
backup [20] 
yFS [49] 


ing department 


64, random file names 


10702 files from /usr/local, total size 354 MB 
633 MP3 files, 860 program files, 11502 man pages 


2 million files, mean size 86 KB, median 4 KB, largest 
file size 2.7 GB, smallest 0 Bytes, total size 166.6 GB 


File system with 1641 files, 109 dirs, 13.4 MB total size | Performance of backup and restore utilities 


Randomly generated files of “several” megabytes 
File system with 1676 files and 13 MB total size 
5-deep directory tree, 5 subdirs and 10 8 KB files per 


1000 files distributed evenly across 10 directories and 


Avg. file size 16 KB, avg. number of files per directory 





Description Used to measure 


File system with 17000 files totaling 150 MB 


Checksum and metadata replication overhead; 
Disk space overhead; performance of search-like 
File insertion, global storage utilization in a P2P 
Data transfer and querying performance, load dur- 
Performance of Segank: volume update, creation 


Files taken from /usr to get “realistic” mix of file sizes Performance with varying contribution of space 
from local file systems 


188 GB and 129 GB volumes taken from the Engineer- 


Performance of physical and logical backup, and 
recovery strategies 


Performance under various benchmarks (file cre- 
ation, deletion) 


Table 1: Choice of file system parameters in prior research. 


available in the form of empirical studies of file-system 
contents [1, 12, 21, 29, 41, 42]. These studies focus on 
measuring and modeling different aspects of file-system 
metadata by collecting snapshots of file-system images 
from real machines. The studies range from a few ma- 
chines to tens of thousands of machines across different 
operating systems and usage environments. Collecting 
and analyzing this data provides useful information on 
how file systems are used in real operating conditions. 


In spite of the wealth of information available in file- 
system studies, system designers and evaluators continue 
to rely on ad hoc assumptions and often inaccurate rules 
of thumb. Table 1 presents evidence to confirm this hy- 
pothesis; it contains a (partial) list of publications from 
top-tier systems conferences in the last ten years that re- 
quired a test file-system image for evaluation. We present 
both the description of the file-system image provided in 
the paper and the intended goal of the evaluation. 


In the table, there are several examples where a new 
file system or application design is evaluated on the eval- 
uator’s personal file system without describing its prop- 
erties in sufficient detail for it to be reproduced [7, 20, 
36]. In others, the description is limited to coarse-grained 
measures such as the total file-system size and the num- 
ber of files, even though other file-system attributes (e.g., 
tree depth) are relevant to measuring performance or 
storage space overheads [9, 10, 17, 30]. File systems are 
also sometimes generated with parameters chosen ran- 
domly [47, 49], or chosen without explanation of the sig- 
nificance of the values [15, 34, 46]. Occasionally, the 
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parameters are specified in greater detail [40], but not 
enough to recreate the original file system. 


The important lesson to be learnt here is that there 
is no standard technique to systematically include infor- 
mation on file-system images for experimentation. For 
this reason, we find that more often than not, the choices 
made are arbitrary, suited for ease-of-use more than ac- 
curacy and completeness. Furthermore, the lack of stan- 
dardization and reproducibility of these choices makes it 
near-impossible to compare results with other systems. 


To address these problems and improve one important 
aspect of file system benchmarking, we develop Impres- 
sions, a framework to generate representative and statis- 
tically accurate file-system images. Impressions gives 
the user flexibility to specify one or more parameters 
from a detailed list of file system parameters (file-system 
size, number of files, distribution of file sizes, etc.). Im- 
pressions incorporates statistical techniques (automatic 
curve-fitting, resolving multiple constraints, interpola- 
tion and extrapolation, etc.) and uses statistical tests for 
goodness-of-fit to ensure the accuracy of the image. 


We believe Impressions will be of great use to sys- 
tem designers, evaluators, and users alike. A casual user 
looking to create a representative file-system image with- 
out worrying about carefully selecting parameters can 
simply run Impressions with its default settings; Impres- 
sions will use pre-specified distributions from file-system 
studies to create a representative image. A more sophisti- 
cated user has the power to individually control the knobs 
for a comprehensive set of file-system parameters; Im- 
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Figure 1: Impact of directory tree structure. Shows 
impact of tree depth on time taken by find. The file systems are created 
by Impressions using default distributions (Table 2). To exclude effects 
of the on-disk layout, we ensure a perfect disk layout (layout score 1.0) 
for all cases except the one with fragmentation (layout score 0.95). 
The flat tree contains all 100 directories at depth 1; the deep tree has 
directories successively nested to create a tree of depth 100. 


pressions will carefully work out the statistical details to 
produce a consistent and accurate image. In both cases, 
Impressions ensures complete reproducibility of the im- 
age, by reporting the used distributions, parameter val- 
ues, and seeds for random number generators. 

In this paper we present the design, implementation 
and evaluation of the Impressions framework (§3), which 
we intend to release for public use in the near future. Im- 
pressions is built with the following design goals: 


e Accuracy: in generating various statistical con- 
structs to ensure a high degree of statistical rigor. 


e Flexibility: in allowing users to specify a number of 
file-system distributions and constraints on parame- 
ter values, or in choosing default values. 


e Representativeness: by incorporating known distri- 
butions from file-system studies. 

e Ease of use: by providing a simple, yet powerful, 
command-line interface. 


Using desktop search as a case study, we demonstrate the 
usefulness and ease of use of Impressions in quantifying 
application performance, and in finding application poli- 
cies and bugs (84). To bring the paper to a close, we 
discuss related work (§5), and finally conclude (86). 


2 Extended Motivation 


We begin this section by asking a basic question: does 
file-system structure really matter? We then describe the 
goals for generating realistic file-system images and dis- 
cuss existing approaches to do so. 


2.1 Does File-System Structure Matter? 

Structure and organization of file-system metadata mat- 
ters for workload performance. Let us take a look at 
the simple example of a frequently used UNIX utility: 
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find. Figure | shows the relative time taken to run 
“find /” searching for a file name on a test file sys- 
tem as we vary some parameters of file-system state. 
The first bar represents the time taken for the run on 
the original test file system. Subsequent bars are normal- 
ized to this time and show performance for a run with the 
file-system contents in buffer cache, a fragmented ver- 
sion of the same file system, a file system created by flat- 
tening the original directory tree, and finally one by deep- 
ening the original directory tree. The graph echoes our 
understanding of caching and fragmentation, and brings 
out one aspect that is often overlooked: structure really 
matters. From this graph we can see that even for a sim- 
ple workload, the impact of tree depth on performance 
can be as large as that with fragmentation, and varying 
tree depths can have significant performance variations 
(300% between the flat and deep trees in this example). 
Assumptions about file-system structure have often 
trickled into file system design, but no means exist to 
incorporate the effects of realistic file-system images in 
a systematic fashion. As a community, we well under- 
stand that caching matters, and have begun to pay atten- 
tion to fragmentation, but when it comes to file-system 
structure, our approach is surprisingly laissez faire. 


2.2 Goals for Generating FS Images 
We believe that the file-system image used for an evalua- 
tion should be realistic with respect to the workload; the 
image should contain a sufficient degree of detail to real- 
istically exercise the workload under consideration. An 
increasing degree of detail will likely require more effort 
and slow down the process. Thus it is useful to know 
the degree sufficient for a given evaluation. For exam- 
ple, if the performance of an application simply depends 
on the size of files in the file system, the chosen file- 
system image should reflect that. On the other hand, if 
the performance is also sensitive to the fraction of binary 
files amongst all files (e.g., to evaluate desktop search in- 
dexing), then the file-system image also needs to contain 
realistic distributions of file extensions. 

We walk through some examples that illustrate the dif- 
ferent degrees of detail needed in file-system images. 


e At one extreme, a system could be completely 
oblivious to both metadata and content. An exam- 
ple of such a system is a mirroring scheme (RAID- 
1 [35]) underneath a file system, or a backup util- 
ity taking whole-disk backups. The performance of 
such schemes depends solely on the block traffic. 


Alternately, systems could depend on the attributes of the 
file-system image with different degrees of detail: 


e The performance of a system can depend on the 
amount of file data (number of files and directories, 
or the size of files and directories, or both) in any 
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given file system (e.g., a backup utility taking whole 
file-system snapshots). 


e Systems can depend on the structure of the file sys- 
tem namespace and how files are organized in it 
(e.g., a version control system for a source-code 
repository). 


e Finally, many systems also depend on the actual 
data stored within the files (e.g., a desktop search 
engine for a file system, or a spell-checker). 


Impressions is designed with this goal of flexibility 
from the outset. The user is given complete control 
of a number of file-system parameters, and is provided 
with an easy to use interface. Transparently, Impressions 
seamlessly ensures accuracy and representativeness. 


2.3 Existing Approaches 


One alternate approach to generating realistic file-system 
images is to randomly select a set of actual images from 
a corpus, an approach popular in other fields of computer 
science such as Information Retrieval, Machine Learning 
and Natural Language Processing [32]. In the case of file 
systems the corpus would consist of a set of known file- 
system images (e.g., tarballs). This approach arguably 
has several limitations which make it difficult and un- 
suitable for file systems research. First, there are too 
many parameters required to accurately describe a file- 
system image that need to be captured in a corpus. Sec- 
ond, without precise control in varying these parameters 
according to experimental needs, the evaluation can be 
blind to the actual performance dependencies. Finally, 
the cost of maintaining and sharing any realistic corpus 
of file-system images would be prohibitive. The size of 
the corpus itself would severely restrict its usefulness es- 
pecially as file systems continue to grow larger. 

Unfortunately, these limitations have not deterred re- 
searchers from using their personal file systems as a (triv- 
ial) substitute for a file-system corpus. 


3 The Impressions Framework 


In this section we describe the design, implementation 
and evaluation of Impressions: a framework for gener- 
ating file-system images with realistic and statistically 
accurate metadata and content . Impressions is flexible 
enough to create file-system images with varying config- 
urations, guaranteeing the accuracy of images by incor- 
porating a number of statistical tests and techniques. 

We first present a summary of the different modes of 
operation of Impressions, and then describe the individ- 
ual statistical constructs in greater detail. Wherever ap- 
plicable, we evaluate their accuracy and performance. 
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Default Model & Parameters 


Directory count w/ depth | Generative model 

Directory size (subdirs) Generative model 

File size by count Lognormal-body 

(a1=0.99994, =9.48, c=2.46) 
Pareto-tail (k=0.91,V¥m=512MB) 
Mixture-of-lognormals 
(a1=0.76, W1=14.83, 01 =2.35 
a2=0.24, j22=20.93, 72=1.48) 
Percentile values 

Poisson (A=6.49) 

Mean file size values 
Inverse-polynomial 

(degree=2, offset=2.36) 
Conditional probabilities 
(biases for special dirs) 

Layout score (1.0) 

or Pre-specified workload 


File size by containing 
bytes 


Extension popularity 
File count w/ depth 
Bytes with depth 
Directory size (files) 


File count w/ depth 
(w/ special directories) 
Degree of Fragmentation 


Table 2: Parameters and default values in Impres- 
sions. List of distributions and their parameter values used in the 
Default mode. 


3.1 Modes of Operation 
A system evaluator can use Impressions in different 
modes of operation, with varying degree of user input. 

Sometimes, an evaluator just wants to create a repre- 
sentative file-system image without worrying about the 
need to carefully select parameters. Hence, in the auto- 
mated mode, Impressions is capable of generating a file- 
system image with minimal input required from the user 
(e.g., the size of the desired file-system image), relying 
on default settings of known empirical distributions to 
generate representative file-system images. We refer to 
these distributions as original distributions. 

At other times, users want more control over the im- 
ages, for example, to analyze the sensitivity of perfor- 
mance to a given file-system parameter, or to describe a 
completely different file-system usage scenario. Hence, 
Impressions supports a user-specified mode, where a 
more sophisticated user has the power to individually 
control the knobs for a comprehensive set of file-system 
parameters; we refer to these as user-specified distribu- 
tions. Impressions carefully works out the statistical de- 
tails to produce a consistent and accurate image. 

In both the cases, Impressions ensures complete repro- 
ducibility of the file-system image by reporting the used 
distributions, their parameter values, and seeds for ran- 
dom number generators. 

Impressions can use any dataset or set of parameter- 
ized curves for the original distributions, leveraging a 
large body of research on analyzing file-system proper- 
ties [1, 12, 21, 29, 41, 42]. For illustration, in this pa- 
per we use a recent static file-system snapshot dataset 
made publicly available [1]. The snapshots of file-system 
metadata were collected over a five-year period repre- 
senting over 60, 000 Windows PC file systems in a large 
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corporation. These snapshots were used to study dis- 
tributions and temporal changes in file size, file age, 
file-type frequency, directory size, namespace structure, 
file-system population, storage capacity, and degree of 
file modification. The study also proposed a generative 
model explaining the creation of file-system namespaces. 

Impressions provides a comprehensive set of individ- 
ually controllable file system parameters. Table 2 lists 
these parameters along with their default selections. For 
example, a user may specify the size of the file-system 
image, the number of files in the file system, and the dis- 
tribution of file sizes, while selecting default settings for 
all other distributions. In this case, Impressions will en- 
sure that the resulting file-system image adheres to the 
default distributions while maintaining the user-specified 
invariants. 


3.2 Basic Techniques 

The goal of Impressions is to generate realistic file- 
system images, giving the user complete flexibility and 
control to decide the extent of accuracy and detail. To 
achieve this, Impressions relies on a number of statistical 
techniques. 

In the simplest case, Impressions needs to create sta- 
tistically accurate file-system images with default distri- 
butions. Hence, a basic functionality required by Im- 
pressions is to convert the parameterized distributions 
into real sample values used to create an instance of a 
file-system image. Impressions uses random sampling to 
take a number of independent observations from the re- 
spective probability distributions. Wherever applicable, 
such parameterized distributions provide a highly com- 
pact and easy-to-reproduce representation of observed 
distributions. For cases where standard probability dis- 
tributions are infeasible, a Monte Carlo method is used. 

A user may want to use file system datasets other than 
the default choice. To enable this, Impressions provides 
automatic curve-fitting of empirical data. 

Impressions also provides the user with the flexibil- 
ity to specify distributions and constraints on parame- 
ter values. One challenge thus is to ensure that multi- 
ple constraints specified by the user are resolved con- 
sistently. This requires statistical techniques to ensure 
that the generated file-system images are accurate with 
respect to both the user-specified constraints and the de- 
fault distributions. 

In addition, the user may want to explore values of file 
system parameters, not captured in any dataset. For this 
purpose, Impressions provides support for interpolation 
and extrapolation of new curves from existing datasets. 

Finally, to ensure the accuracy of the generated im- 
age, Impressions contains a number of built-in statisti- 
cal tests, for goodness-of-fit (e.g., Kolmogorov-Smirnov, 
Chi-Square, and Anderson-Darling), and to estimate er- 


ror (e.g., Confidence Intervals, MDCC, and Standard Er- 
ror). Where applicable, these tests ensure that all curve- 
fit approximations and internal statistical transformations 
adhere to the highest degree of statistical rigor desired. 


3.3. Creating Valid Metadata 

The simplest use of Impressions is to generate file- 
system images with realistic metadata. This process is 
performed in two phases: first, the skeletal file-system 
namespace is created; and second, the namespace is pop- 
ulated with files conforming to a number of file and di- 
rectory distributions. 


3.3.1 Creating File-System Namespace 

The first phase in creating a file system is to create the 
namespace structure or the directory tree. We assume 
that the user specifies the size of the file-system image. 
The count of files and directories is then selected based 
on the file system size (if not specified by the user). De- 
pending on the degree of detail desired by the user, each 
file or directory attribute is selected step by step until all 
attributes have been assigned values. We now describe 
this process assuming the highest degree of detail. 

To create directory trees, Impressions uses the gener- 
ative model proposed by Agrawal et al. [1] to perform a 
Monte Carlo simulation. According to this model, new 
directories are added to a file system one at a time, and 
the probability of choosing each extant directory as a par- 
ent is proportional to C(d) +2, where C(d) is the count of 
extant subdirectories of directory d. The model explains 
the creation of the file system namespace, accounting 
both for the size and count of directories by depth, and 
the size of parent directories. The input to this model is 
the total number of directories in the file system. Direc- 
tory names are generated using a simple iterative counter. 

To ensure the accuracy of generated images, we com- 
pare the generated distributions (i.e., created using the 
parameters listed in Table 2), with the desired distribu- 
tions (i.e., ones obtained from the dataset discussed pre- 
viously in §3.1). Figure 2 shows in detail the accuracy 
for each step in the namespace and file creation process. 
For almost all the graphs, the y-axis represents the per- 
centage of files, directories, or bytes belonging to the cat- 
egories or bins shown on the x-axis, as the case may be. 

Figures 2(a) and 2(b) show the distribution of directo- 
ries by depth, and directories by subdirectory count, re- 
spectively. The y-axis in this case is the percentage of di- 
rectories at each level of depth in the namespace, shown 
on the x-axis. The two curves representing the generated 
and the desired distributions match quite well, indicating 
good accuracy and reaffirming prior results [1]. 


3.3.2 Creating Files 
The next phase is to populate the directory tree with files. 
Impressions spends most of the total runtime and effort 
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Figure 2: Accuracy of Impressions in recreating file system properties. 
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file system state for all parameters of interest shown here. We include a special abscissa for the zero value on graphs having a logarithmic scale. 


during this phase, as the bulk of its statistical machinery 
is exercised in creating files. Each file has a number of 
attributes such as its size, depth in the directory tree, par- 
ent directory, and file extension. Similarly, the choice of 
the parent directory is governed by directory attributes 
such as the count of contained subdirectories, the count 
of contained files, and the depth of the parent directory. 
Analytical approximations for file system distributions 
proposed previously [12] guided our own models. 


First, for each file, the size of the file is sampled 
from a hybrid distribution describing file sizes. The 
body of this hybrid curve is approximated by a lognor- 
mal distribution, with a Pareto tail distribution (k=0.91, 
Xm=512MB) accounting for the heavy tail of files with 
size greater than 512 MB. The exact parameter values 
used for these distributions are listed in Table 2. These 
parameters were obtained by fitting the respective curves 
to file sizes obtained from the file-system dataset previ- 
ously discussed (§3.1). Figure 2(c) shows the accuracy 
of generating the distribution of files by size. We initially 
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used a simpler model for file sizes represented solely by 
a lognormal distribution. While the results were accept- 
able for files by size (Figure 2(c)), the simpler model 
failed to account for the distribution of bytes by contain- 
ing file size; coming up with a model to accurately cap- 
ture the bimodal distribution of bytes proved harder than 
we had anticipated. Figure 2(d) shows the accuracy of 
the hybrid model in Impressions in generating the distri- 
bution of bytes. The pronounced double mode observed 
in the distribution of bytes is a result of the presence of 
a few large files; an important detail that is otherwise 
missed if the heavy-tail of file sizes is not accurately ac- 
counted for. 


Once the file size is selected, we assign the file name 
and extension. Impressions keeps a list of percentile val- 
ues for popular file extensions (i.e., top 20 extensions by 
count, and by bytes). These extensions together account 
for roughly 50% of files and bytes in a file system ensur- 
ing adequate coverage for the important extensions. The 
remainder of files are given randomly generated three- 
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MDCC 


Directory count with depth 0.03 
Directory size (subdirectories) 0.004 
File size by count 0.04 


File size by containing bytes 0.02 
Extension popularity 0.03 
File count with depth 0.05 


Bytes with depth 0.12 MB* 
File count w/ depth w/ special dirs 0.06 





Table 3: Statistical accuracy of generated images. 
Shows average accuracy of generated file-system images in terms of 
the MDCC (Maximum Displacement of the Cumulative Curves) repre- 
senting the maximum difference between cumulative curves of gener- 
ated and desired distributions. Averages are shown for 20 trials. (*) 
For bytes with depth, MDCC is not an appropriate metric, we instead 
report the average difference in mean bytes per file (MB). The numbers 
correspond to the set of graphs shown in Figure 2 and reflect fairly 
accurate images. 


character extensions. Currently filenames are generated 
by a simple numeric counter incremented on each file 
creation. Figure 2(e) shows the accuracy of Impressions 
in creating files with popular extensions by count. 


Next, we assign file depth d, which requires satisfying 
two criteria: the distribution of files with depth, and the 
distribution of bytes with depth. The former is modeled 
by a Poisson distribution, and the latter is represented 
by the mean file sizes at a given depth. Impressions 
uses a multiplicative model combining the two criteria, 
to produce appropriate file depths. Figures 2(f) and 2(g) 
show the accuracy in generating the distribution of files 
by depth, and the distribution of bytes by depth, respec- 
tively. 

The final step is to select a parent directory for the 
file, located at depth d — 1, according to the distribution 
of directories with file count, modeled using an inverse- 
polynomial of degree 2. As an added feature, Impres- 
sions supports the notion of “Special” directories con- 
taining a disproportionate number of files or bytes (e.g., 
“Program Files” folder in the Windows environment). If 
required, during the selection of the parent directory, a 
selection bias is given to these special directories. Fig- 
ure 2(h) shows the accuracy in supporting special direc- 
tories with an example of a typical Windows file system 
having files in the web cache at depth 7, in Windows 
and Program Files folders at depth 2, and System 
files at depth 3. 


Table 3 shows the average difference between the gen- 
erated and desired images from Figure 2 for 20 trials. 
The difference is measured in terms of the MDCC (Max- 
imum Displacement of the Cumulative Curves). For 
instance, an MDCC value of 0.03 for directories with 
depth, implies a maximum difference of 3% on an av- 
erage, between the desired and the generated cumulative 
distributions. Overall, we find that the models created 
and used by Impressions for representing various file- 


system parameters produce fairly accurate distributions 
in all the above cases. While we have demonstrated the 
accuracy of Impressions for the Windows dataset, there 
is no fundamental restriction limiting it to this dataset. 
We believe that with little effort, the same level of accu- 
racy can be achieved for any other dataset. 


3.4 Resolving Arbitrary Constraints 

One of the primary requirements for Impressions is to al- 
low flexibility in specifying file system parameters with- 
out compromising accuracy. This means that users are al- 
lowed to specify somewhat arbitrary constraints on these 
parameters, and it is the task of Impressions to resolve 
them. One example of such a set of constraints would be 
to specify a large number of files for a small file system, 
or vice versa, given a file size distribution. Impressions 
will try to come up with a sample of file sizes that best 
approximates the desired distribution, while still main- 
taining the invariants supplied by the user, namely the 
number of files in the file system and the sum of all file 
sizes being equal to the file system used space. 

Multiple constraints can also be implicit (i.e., arise 
even in the absence of user-specified distributions). Due 
to random sampling, different sample sets of the same 
distribution are not guaranteed to produce exactly the 
same result, and consequently, the sum of the elements 
can also differ across samples. Consider the previous ex- 
ample of file sizes again: the sum of all file sizes drawn 
from a given distribution need not add up to the desired 
file system size (total used space) each time. More for- 
mally, this example is represented by the following set of 
constraints: 


N = {Constant, Vx: x € Di(x)} 


S = {Constantz V x: x € Do(x)} 


N 
F ={x: x4 € D3(x; p,0)}; | SOF -S |<6xS 

i=0 
where \V is the number of files in the file system; S is 
the desired file system used space; F is the set of file 
sizes; and (3 is the maximum relative error allowed. The 
first two constraints specify that \V and S can be user 
specified constants or sampled from their corresponding 
distributions D; and D2. Similarly, F is sampled from 
the file size distribution D3. These attributes are further 
subject to the constraint that the sum of all file sizes dif- 
fers from the desired file system size by no more than the 
allowed error tolerance, specified by the user. To solve 
this problem, we use the following two techniques: 


e If the initial sample does not produce a result satisfy- 
ing all the constraints, we oversample additional values 
of F from D3, one at a time, until a solution is found, or 
the oversampling factor a/N reaches \ (the maximum 
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Figure 3: Resolving Multiple Constraints. 
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(a) Shows the process of convergence of a set of 1000 file sizes to the desired file 


system size of 90000 bytes. Each line represents an individual trial. A successful trial is one that converges to the 5% error line in less than 1000 
oversamples. (b) Shows the difference between the original distribution of files by size, and the constrained distribution after resolution of multiple 
constraints in (a). O: Original; C: Constrained. (c) Same as (b), but for distribution of files by bytes instead. 


ee files | Sum of file sizes | File size Bore Avg. 3 | Avg. 3 | Avg. a | Avg. D | Avg. D | Success 
S (bytes) Initial Final Count Bytes 


(u=8. rie o=2.46) 


(w=8.16, c=2.46) 


21.55% 
20.01% 


0.043 
0.032 


0.050 
0.033 





(w=8.16, c=2.46) 


Table 4: Summary of resolving multiple constraints. 


34.35% 


0.067 0.084 


Shows average rate and accuracy of convergence after resolving multiple 


constraints for different values of desired file system size. 3: % error between the desired and generated sum, a: % of oversamples required, D is 
the test statistic for the K-S test representing the maximum difference between generated and desired empirical cumulative distributions. Averages 
are for 20 trials. Success is the number of trials having final B < 5%, and D passing the K-S test. 


oversampling factor). a is the count of extra samples 
drawn from D3. Upon reaching \ without finding a so- 
lution, we discard the current sample set and start over. 


e The number of elements in F during the oversampling 
stage is V’ + a. For every oversampling, we need to find 
if there exists Fs,,, a subset of F with V elements, such 
that the sum of all elements of Fx,» (file sizes) differs 
from the desired file system size by no more than the 
allowed error. More formally stated, we find if: 





3 Few ={¥ 2X CPF), |X| =N, |FI=N +0, 
“ a 
) xX;,-S|< S, ENA <A 
2 le N } 


The problem of resolving multiple constraints as for- 
mulated above, is a variant of the more general “Subset 
Sum Problem” which is NP-complete [8]. Our solution 
is thus an approximation algorithm based on an existing 
O(n log n) solution [37] for the Subset Sum Problem. 


The existing algorithm has two phases. The first phase 
randomly chooses a solution vector which is valid (the 
sum of elements is less than the desired sum), and maxi- 
mal (adding any element not already in the solution vec- 
tor will cause the sum to exceed the desired sum). The 
second phase performs local improvement: for each el- 
ement in the solution, it searches for the largest element 
not in the current solution which, if replaced with the cur- 
rent element, would reduce the difference between the 
desired and current sums. The solution vector is updated 
if such an element is found, and the algorithm proceeds 
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with the next element, until all elements are compared. 
Our problem definition and the modified algorithm 
differ from the original in the following ways: 


e First, in the original problem, there is no restriction on 
the number of elements in the solution subset Fs,,,. In 
our case, Fg, can have exactly VV elements. We modify 
the first phase of the algorithm to set the initial Fox, 
as the first random permutation of V elements selected 
from F such that their sum is less than S. 


e Second, the original algorithm either finds a solution 
or terminates without success. We use an increasing 
sample size after each oversampling to reduce the error, 
and allow the solution to converge. 


e Third, it is not sufficient for the elements in Fg,,, to 
have a numerical sum close to the desired sum S, but 
the distribution of the elements must also be close to the 
original distribution in F. A goodness-of-fit test at the 
end of each oversampling step enforces this requirement. 
For our example, this ensures that the set of file sizes 
generated after resolving multiple constraints still follow 
the original distribution of file sizes. 


The algorithm terminates successfully when the differ- 
ence between the sums, and between the distributions, 
falls below the desired error levels. The success of the 
algorithm depends on the choice of the desired sum, and 
the expected sum (the sum due to the choice of parame- 
ters, e.g., 4 and o); the farther the desired sum is from 
the expected sum, the lesser are the chances of success. 
Consider an example where a user has specified a de- 
sired file system size of 90000 bytes, a lognormal file 
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File Size (bytes, log scale, power-of-2 bins) 
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size distribution (w=8.16, c=2.46), and 1000 files. Fig- 
ure 3(a) shows the convergence of the sum of file sizes 
in a sample set obtained with this distribution. Each line 
in the graph represents an independent trial, starting at a 
y-axis value equal to the sum of its initially sampled file 
sizes. Note that in this example, the initial sum differs 
from the desired sum by more than a 100% in several 
cases. The x-axis represents the number of extra itera- 
tions (oversamples) performed by the algorithm. For a 
trial to succeed, the sum of file sizes in the sample must 
converge to within 5% of the desired file system size. We 
find that in most cases ranges between 0 and 0.1 (i.e., 
less than 10% oversampling); and in almost all cases, 
A <1. 

The distribution of file sizes in Fg,,, must be close 
to the original distribution in F. Figure 3(b) and 3(c) 
show the difference between the original and constrained 
distributions for file sizes (for files by size, and files 
by bytes), for one successful trial from Figure 3(a). 
We choose these particular distributions as examples 
throughout this paper for two reasons. First, file size is 
an important parameter, so we want to be particularly 
thorough in its accuracy. Second, getting an accurate 
shape for the bimodal curve of files by bytes presents 
a challenge for Impressions; once we get our techniques 
to work for this curve, we are fairly confident of its ac- 
curacy on simpler distributions. 

We find that Impressions resolves multiple constraints 
to satisfy the requirement on the sum, while respecting 
the original distributions. Table 4 gives the summary for 
the above example of file sizes for different values of the 
desired file system size. The expected sum of 1000 file 
sizes, sampled as specified in the table, is close to 60000. 
Impressions successfully converges the initial sample set 
to the desired sum with an average oversampling rate a 
less than 5%. The average difference between the desired 
and achieved sum (3 is close to 3%. The constrained dis- 
tribution passes the two-sample K-S test at the 0.05 sig- 
nificance level, with the difference between the two dis- 
tributions being fairly small (the D statistic of the K-S 
test is around 0.03, which represents the maximum dif- 
ference between two empirical cumulative distributions). 

We repeat the above experiment for two more choices 
of file system sizes, one lower than the expected mean 
(30K), and one higher (90K); we find that even when the 
desired sum is quite different from the expected sum, our 
algorithm performs well. Only for 2 of the 20 trials in the 
90K case, did the algorithm fail to converge. For these 
extreme cases, we drop the initial sample and start over. 


3.5 Interpolation and Extrapolation 

Impressions requires knowledge of the distribution of 
file system parameters necessary to create a valid im- 
age. While it is tempting to imagine that Impressions has 
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Figure 4: Piecewise Interpolation of File Sizes. Piece- 


wise interpolation for the distribution of files with bytes, using file sys- 
tems of 10 GB, 50 GB and 100 GB. Each power-of-two bin on the x- 
axis is treated as an individual segment for interpolation (inset). Final 
curve is the composite of all individual interpolated segments. 


Distribution FS Region D K-S Test 
(I/E) Statistic (0.05) 


File sizes by count 75GB (1) 

125GB (BE) 

75GB (I) 

125GB (BE) 

Table 5: Accuracy of interpolation and extrapolation. 


Impressions produces accurate curves for file systems of size 75 GB and 
125 GB, using interpolation (I) and extrapolation (E), respectively. 


File sizes by count 
File sizes by bytes 
File sizes by bytes 





perfect knowledge about the nature of these distributions 
for all possible values and combinations of individual pa- 
rameters, it is often impossible. 

First, the empirical data is limited to what is observed 
in any given dataset and may not cover the entire range 
of possible values for all parameters. Second, even with 
an exhaustive dataset, the user may want to explore re- 
gions of parameter values for which no data point exists, 
especially for “what if” style of analysis. Third, from an 
implementation perspective, it is more efficient to main- 
tain compact representations of distributions for a few 
sample points, instead of large sets of data. Finally, if 
the empirical data is statistically insignificant, especially 
for outlying regions, it may not serve as an accurate rep- 
resentation. Impressions thus provides the capability for 
interpolation and extrapolation from available data and 
distributions. 

Impressions needs to generate complete new curves 
from existing ones. To illustrate our procedure, we de- 
scribe an example of creating an interpolated curve; ex- 
tensions to extrapolation are straightforward. Figure 4 
shows how Impressions uses piece-wise interpolation for 
the distribution of files with containing bytes. In this ex- 
ample, we start with the distribution of file sizes for file 
systems of size 10 GB, 50 GB and 100 GB, shown in the 
figure. Each power-of-two bin on the x-axis is treated 
as an individual segment, and the available data points 
within each segment are used as input for piece-wise in- 
terpolation; the process is repeated for all segments of the 
curve. Impressions combines the individual interpolated 
segments to obtain the complete interpolated curve. 

To demonstrate the accuracy of our approach, we in- 
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Figure 5: Accuracy of Interpolation and Extrapolation. Shows results of applying piece-wise interpolation to generate file size 
distributions (by count and by bytes), for file systems of size 75 GB (a and b, respectively), and 125 GB (c and d, respectively). 


terpolate and extrapolate file size distributions for file 
systems of sizes 75 GB and 125 GB, respectively. Fig- 
ure 5 shows the results of applying our technique, com- 
paring the generated distributions with actual distribu- 
tions for the file system sizes (we removed this data from 
the dataset used for interpolation). We find that the sim- 
pler curves such as Figure 5(a) and (c) are interpolated 
and extrapolated with good accuracy. Even for more 
challenging curves such as Figure 5(b) and (d), the re- 
sults are accurate enough to be useful. Table 5 con- 
tains the results of conducting K-S tests to measure the 
goodness-of-fit of the generated curves. All the gener- 
ated distributions passed the K-S test at the 0.05 signifi- 
cance level. 


3.6 File Content 


Actual file content can have substantial impact on the 
performance of an application. For example, Post- 
mark [24], one of the most popular file system bench- 
marks, tries to simulate an email workload, yet it pays 
scant attention to the organization of the file system, and 
is completely oblivious of the file data. Postmark fills 
all the “email” files with the same data, generated using 
the same random seed. The evaluation results can range 
from misleading to completely inaccurate, for instance 
in the case of content-addressable storage (CAS). When 
evaluating a CAS-based system, the disk-block traffic 
and the corresponding performance will depend only on 
the unique content — in this case belonging to the largest 
file in the file system. Similarly, performance of Desktop 
Search and Word Processing applications is sensitive to 
file content. 

In order to generate representative file content, Im- 
pressions supports a number of options. For human- 
readable files suchas .txt, . html files, it can populate 
file content with random permutations of symbols and 
words, or with more sophisticated word-popularity mod- 
els. Impressions maintains a list of the relative popularity 
of the most popular words in the English language, and 
a Monte Carlo simulation generates words for file con- 
tent according to this model. However, the distribution 
of word popularity is heavy-tailed; hence, maintaining 
an exhaustive list of words slows down content genera- 
tion. To improve performance, we use a word-length fre- 
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quency model [43] to generate the long tail of words, and 
use the word-popularity model for the body alone. The 
user has the flexibility to select either one of the mod- 
els in entirety, or a specific combination of the two. It 
is also relatively straightforward to add extensions in the 
future to generate more nuanced file content. An exam- 
ple of such an extension is one that carefully controls the 
degree of content similarity across files. 

In order to generate content for typed files, Impres- 
sions either contains enough information to generate 
valid file headers and footers itself, or calls into a third- 
party library or software such as Id3v2 [31] for mp3; 
GraphApp [18] for gif, jpeg and other image files; 
Mplayer [28] for mpeg and other video files; asciidoc 
for htm1; and ascii2pdf for PDF files. 


3.7. Disk Layout and Fragmentation 


To isolate the effects of file system content, Impressions 
can measure the degree of on-disk fragmentation, and 
create file systems with user-defined degree of fragmen- 
tation. The extent of fragmentation is measured in terms 
of layout score [44]. A layout score of 1 means all files 
in the file system are laid out optimally on disk (i.e., all 
blocks of any given file are laid out consecutively one 
after the other), while a layout score of 0 means that no 
two blocks of any file are adjacent to each other on disk. 

Impressions achieves the desired degree of fragmenta- 
tion by issuing pairs of temporary file create and delete 
operations, during creation of regular files. When ex- 
perimenting with a file-system image, Impressions gives 
the user complete control to specify the overall layout 
score. In order to determine the on-disk layout of files, 
we rely on the information provided by debugfs. Thus 
currently we support layout measurement only for Ext2 
and Ext3. In future work, we will consider several al- 
ternatives for retrieving file layout information across a 
wider range of file systems. On Linux, the FIBMAP 
and FIEMAP ioct1 ()s are available to map a logical 
block to a physical block [23]. Other file system-specific 
methods exist, such as the XFS_IOC_GETBMAP ioctl 
for XFS. 

The previous approach however does not account for 
differences in fragmentation strategies across file sys- 
tems. Impressions supports an alternate specification 


File Size (bytes, log scale, power-of-2 bins) 
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Time taken (seconds) 
FS distribution (Default) Imagei Image2 


Directory structure 1.26 
File sizes distribution 0.28 


Popular extensions 0.13 
File with depth 0.29 


File and bytes with depth 0.70 
File content (Single-word) 1.44 
On-disk file/dir creation 1394.84 


Total time 473.20 1826.12 
eee | ein | Go minn | 
File content (Hybrid model) 791.20 
[Eyewear (09) | tain |_| 


Table 6: Performance of Impressions. — Shows time taken 
to create file-system images with break down for individual features. 
Imagei1: 4.55 GB, 20000 files, 4000 dirs. Imagez: 12.0 GB, 52000 
files, 4000 dirs. Other parameters are default. The two entries for 


additional parameters are shown only for Image and represent times 
in addition to default times. 


for the degree of fragmentation wherein it runs a pre- 
specified workload and reports the resulting layout score. 
Thus if a file system employs better strategies to avoid 
fragmentation, it is reflected in the final layout score af- 
ter running the fragmentation workload. 

There are several alternate techniques for inducing 
more realistic fragmentation in file systems. Factors such 
as burstiness of I/O traffic, out-of-order writes and inter- 
file layout are currently not accounted for; a companion 
tool to Impressions for carefully creating fragmented file 
systems will thus be a good candidate for future research. 





3.8 Performance 

In building Impressions, our primary objective was to 
generate realistic file-system images, giving top priority 
to accuracy, instead of performance. Nonetheless, Im- 
pressions does perform reasonably well. Table 6 shows 
the breakdown of time taken to create a default file- 
system image of 4.55 GB. We also show time taken for 
some additional features such as using better file content, 
and creating a fragmented file system. Overall, we find 
that Impressions creates highly accurate file-system im- 
ages in a reasonable amount of time and thus is useful in 
practice. 


4 Case Study: Desktop Search 


In this section, we use Impressions to evaluate desktop 
searching applications. Our goals for this case study are 
two-fold. First, we show how simple it is to use Impres- 
sions to create either representative images or images 
across which a single parameter is varied. Second, we 
show how future evaluations should report the settings 
of Impressions so that results can be easily reproduced. 
We choose desktop search for our case study because 
its performance and storage requirements depend not 
only on the file system size and structure, but also on the 
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type of files and the actual content within the files. We 
evaluate two desktop search applications: open-source 
Beagle [5] and Google’s Desktop for Linux (GDL) [16]. 
Beagle supports a large number of file types using 52 
search-filters; it provides several indexing options, trad- 
ing performance and index size with the quality and 
feature-richness of the index. Google Desktop does not 
provide as many options: a web interface allows users to 
select or exclude types of files and folder locations for 
searching, but does not provide any control over the type 
and quality of indexing. 


4.1 Representative Images 


Developers of data-intensive applications frequently 
need to make assumptions about the properties of file- 
system images. For example, file systems and applica- 
tions can often be optimized if they know properties such 
as the relative proportion of meta-data to data in repre- 
sentative file systems. Previously, developers could infer 
these numbers from published papers [1, 12, 41, 42], but 
only with considerable effort. With Impressions, devel- 
opers can simply create a sample of representative im- 
ages and directly measure the properties of interest. 

Table 6 lists assumptions we found in GDL and Beagle 
limiting the search indexing to partial regions of the file 
system. However, for the representative file systems in 
our data set, these assumptions omit large portions of the 
file system. For example, GDL limits its index to only 
those files less than ten directories deep; our analysis of 
typical file systems indicates that this restriction causes 
10% of all files to be missed. We believe that instead of 
arbitrarily specifying hard values, application designers 
should experiment with Impressions to find acceptable 
choices. 

We note that Impressions is useful for discovering 
these application assumptions and for isolating perfor- 
mance anomalies that depend on the file-system image. 
Isolating the impact of different file systems feature is 
easy using Impressions: evaluators can use Impressions 
to create file-system images in which only a single pa- 
rameter is varied, while all other characteristics are care - 
fully controlled. 

This type of discovery is clearly useful when one is 
using closed-source code, such as GDL. For example, 
we discovered the GDL limitations by constructing file- 
system images across which a single parameter is var- 
ied (e.g., file depth and file size), measuring the percent- 
age of indexed files, and noticing precipitous drops in 
this percentage. This type of controlled experimenta- 
tion is also useful for finding non-obvious performance 
interactions in open-source code. For instance, Beagle 
uses the inotify mechanism [22] to track each directory 
for change; since the default Linux kernel provides 8192 
watches, Beagle resorts to manually crawling the directo- 
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Parameter & Value Comment on Validity 


File content < 10 deep 


10% of files and 5% of bytes > 10 deep 


(content in deeper namespace is growing) 


Text file sizes < 200 KB 
Text file cutoff < 5 MB 
Archive files < 10 MB 


13% of files and 90% of bytes > 200 KB 
0.13% of files and 71% of bytes > 5 MB 
4% of files and 84% of bytes > 10 MB 





Shell scripts < 20 KB 
Figure 6: Debunking Application Assumptions. 


amount of file-system content that is not indexed as a consequence. 
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20% of files and 89% of bytes > 20 KB 
Examples of assumptions made by Beagle and GDL, along with details of the 
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Figure 7: Impact of file content. Com- 
pares Beagle and GDL index time and space 
for wordmodels and binary files. Google has a 
smaller index for wordmodels, but larger for bi- 
nary. Uses Impressions default settings, with FS 
size 4.55 GB, 20000 files, 4000 dirs. 


ries once their count exceeds 8192. This deterioration in 
performance can be easily found by creating file-system 
images with varying numbers of directories. 


4.2 Reproducible Images 


The time spent by desktop search applications to crawl 
a file-system image is significant (7.e., hours to days); 
therefore, it is likely that different developers will inno- 
vate in this area. In order for developers to be able to 
compare their results, they must be able to ensure they 
are using the same file-system images. Impressions al- 
lows one to precisely control the image and report the 
parameters so that the exact same image can be repro- 
duced. 

For desktop search, the type of files (i.e., their exten- 
sions) and the content of files has a significant impact on 
the time to build the index and its size. We imagine a 
scenario in which the Beagle and GDL developers wish 
to compare index sizes. To make a meaningful compar- 
ison, the developers must clearly specify the file-system 
image used; this can be done easily with Impressions by 
reporting the size of the image, the distributions listed 
in Table 2, the word model, disk layout, and the random 
seed. We anticipate that most benchmarking will be done 
using mostly default values, reducing the number of Im- 
pressions parameters that must be specified. 

An example of the reporting needed for reproducible 
results is shown in Figure 7. In these experiments, all dis- 
tributions of the file system are kept constant, but only ei- 
ther text files (containing either a single word or with the 
default word model) or binary files are created. These 
experiments illustrate the point that file content signif- 
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Figure 8: Reproducible images: impact of content. Using Impressions to make 
results reproducible for benchmarking search. Vertical bars represent file systems created with 
file content as labeled. The Default file system is created using Impressions default settings, and 
file system size 4.55 GB, 20000 files, 4000 dirs. Index options: Original — default Beagle index. 
TextCache — build text-cache of documents used for snippets. DisDir — don’t add directories to 
the index. DisFilter — disable all filtering of files, only index attributes. 


icantly affects the index size; if two systems are com- 
pared using different file content, obviously the results 
are meaningless. Specifically, different file types change 
even the relative ordering of index size between Beagle 
and GDL: given text files, Beagle creates a larger index; 
given binary files, GDL creates a larger index. 


Figures 8 gives an additional example of reporting Im- 
pressions parameters to make results reproducible. In 
these experiments, we discuss a scenario in which differ- 
ent developers have optimized Beagle and wish to mean- 
ingfully compare their results. In this scenario, the orig- 
inal Beagle developers reported results for four different 
images: the default, one with only text files, one with 
only image files, and one with only binary files. Other 
developers later create variants of Beagle: TextCache to 
display a small portion of every file alongside a search 
hit, DisDir to disable directory indexing, and DisFilter 
to index only attributes. Given the reported Impressions 
parameters, the variants of Beagle can be meaningfully 
compared to one another. 


In summary, Impressions makes it extremely easy to 
create both controlled and representative file-system im- 
ages. Through this brief case study evaluating desktop 
search applications, we have shown some of the advan- 
tages of using Impressions. First, Impressions enables 
developers to tune their systems to the file system char- 
acteristics likely to be found in their target user popu- 
lations. Second, it enables developers to easily create 
images where one parameter is varied and all others are 
carefully controlled; this allows one to assess the impact 
of a single parameter. Finally, Impressions enables dif- 
ferent developers to ensure they are all comparing the 
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same image; by reporting Impressions parameters, one 
can ensure that benchmarking results are reproducible. 


5 Related Work 


We discuss previous research in four areas related to file 
system benchmarking and usage of file system metadata. 

First, Impressions enables file system measurement 
studies to be put into practice. Besides the metadata 
studies on Windows workstations [1, 12], previous work 
in non- Windows environment includes Satyanarayanan’s 
study of a Digital PDP-10 [41], Irlam’s and Mullender’s 
studies of Unix systems [21, 29], and the study of HP-UX 
systems at Hewlett-Packard [42]. These studies provide 
valuable data for designers of file systems and related 
software, and can be easily incorporated in Impressions. 

Second, several models have been proposed to ex- 
plain observed file-system phenomena. Mitzenmacher 
proposed a generative model, called the Recursive For- 
est File model [27] to explain the behavior of file size 
distributions. The model accounts for the hybrid distri- 
bution of file sizes with a lognormal body and Pareto tail. 
Downey’s Multiplicative File Size model [13] is based on 
the assumption that new files are created by using older 
files as templates e.g., by copying, editing or filtering an 
old file. The size of the new file in this model is given by 
the size of the old file multiplied by an independent fac- 
tor. These models provide an intuitive understanding of 
the underlying phenomena, and are also easier for com- 
puter simulation. In future, Impressions can be enhanced 
by incorporating more such models. 

Third, a number of tools and techniques have been 
proposed to improve the state of the art of benchmark- 
ing. Chen and Patterson proposed a “self-scaling” bench- 
mark that scales with the I/O system being evaluated, to 
stress the system in meaningful ways [6]. TBBT is a 
NFS trace replay tool that derives the file-system image 
underlying a trace [50]. It extracts the file system hi- 
erarchy from a given trace in depth-first order and uses 
that during initialization for a subsequent trace replay. 
While this ensures a consistent file-system image for re- 
play, it does not solve the more general problem of cre- 
ating accurately controlled images for all types of file 
system benchmarking. The Auto-Pilot tool [48] provides 
an infrastructure for running tests and analysis tools to 
automate the benchmarking process. 

Finally, workload is an important piece of the bench- 
marking puzzle. The SynRGen file reference genera- 
tor by Ebling and Satyanarayan [14] generates synthetic 
equivalents for real file system users. The volumes or 
images in their work make use of simplistic assumptions 
about the file system distributions as their focus is on user 
access patterns. Roselli et al. collected dynamic file sys- 
tem usage patterns in UNIX and Windows NT environ- 


ments and studied file system access behavior [39]. Re- 
cent work on file system workloads includes a study of 
network file system usage at NetApp [25]. 


6 Conclusion 


File system benchmarking is in a state of disarray. One 
key aspect of this problem is generating realistic file- 
system state, with due emphasis given to file-system 
metadata and file content. To address this problem, we 
develop Impressions, a statistical framework to generate 
realistic and configurable file-system images. Impres- 
sions provides the user flexibility in selecting a compre- 
hensive set of file system parameters, while seamlessly 
ensuring accuracy of the underlying images, serving as a 
useful platform for benchmarking. 

In our experience, we find Impressions easy to use 
and well suited for a number of tasks. It enables ap- 
plication developers to tune their systems to the file 
system characteristics likely found in their target users. 
Impressions also makes it feasible to compare perfor- 
mance of systems by standardizing and reporting all 
used parameters, a requirement necessary for bench- 
marking. We believe Impressions will prove to be a 
valuable tool for system developers and users alike; we 
intend to release it for public use in the near future. 
Please check http: //www.cs.wisc.edu/ads1/ 
Software/Impressions/ to obtain a copy. 
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Capture, conversion, and analysis of an intense NFS workload 


Eric Anderson, HP Labs <eric.anderson4 @ hp.com> 


Abstract 


We describe methods to capture, convert, store and ana- 
lyze NFS workloads that are 20-100 more intense, in 
terms of operations/day, than any previously published. 
We describe three techniques that improve capture per- 
formance by up to 10x over previous techniques. For 
conversion, we use a general-purpose format that is both 
highly space efficient and provides efficient access to the 
trace data. For analysis, we describe a number of tech- 
niques adopted from the database community and some 
new techniques that facilitate analysis of very large traces. 
We also describe a number of guidelines for trace collec- 
tion that should prove useful to future practitioners. Fi- 
nally, we analyze a commercial feature animation (movie) 
rendering workload using these techniques and discuss 
the characteristics of the workload. Our implementation 
of these techniques is available as open source and the ex- 
act anonymized datasets we analyze are available for free 
download. 


1 Introduction 


Storage tracing and analysis have a long history. Some of 
the earliest filesystem traces were captured in 1985 [26], 
and there has been intermittent tracing effort since then, 
summarized by Leung [20]. Storage traces are analyzed 
to find properties that future systems should support or 
exploit, and as input to simulators and replay tools to ex- 
plore system performance with real workloads. 

One of the problems with trace analysis is that old 
traces inherently have to be scaled up to be used for eval- 
uating newer storage systems because the underlying per- 
formance of the newer systems has increased. There- 
fore the community benefits from regularly capturing new 
traces from multiple sources, and, if possible, traces that 
put a heavy load on the storage system, reducing the need 
to scale the workload. 

Most traces, since they are captured by academics, 
are captured in academic settings. This means that the 
workloads captured are somewhat comparable, but it also 
means that commercial workloads are under-represented. 
Microsoft is working to correct this by capturing commer- 
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cial enterprise traces from their internal servers [23]. Our 
work focuses on commercial NFS [25, 6, 28] workloads, 
in particular from a feature animation (movie) company, 
whose name remains blinded as part of the agreement to 
publish the traces. The most recent publically available 
NFS traces that we are aware of were collected in 2003 
by Ellard [13]. Our 2003 and 2007 traces [4] provide re- 
cent NFS traces for use by the community. 


One difference between our traces and other ones is the 
data rates that we measured. Our 2003 client traces saw 
about 750 million operations per day. In comparison, the 
2003 Ellard traces saw a peak of about 125 million NFS 
operations per day, and the 2007 Leung traces [20] saw a 
peak of 19 million CIFS operations/day. Our 2007 traces 
saw about 2.4 billion operations/day. This difference re- 
quired us to develop and adopt new techniques to capture, 
convert, and analyze the traces. 

Since our traces were captured in such a different en- 
vironment than prior traces, we limit our comparisons 
to their workloads, and we do not attempt to make any 
claims about trends. We believe that unless we, as a com- 
munity, collect traces from hundreds of different sites, we 
will not have sufficient data to make claims stronger than 
“this workload is different from other ones in these ways.” 
In fact, we make limited comparison of the trends be- 
tween our 2003 and 2007 traces for similar reasons. The 
underlying workload changed as the rendering techniques 
improved to generate higher quality output, the operating 
system generating the requests changed, the NFS proto- 
col version changed, and the configuration of the clients 
changed because of standard technology trends. 

The process of understanding a workload involves four 
main steps, as shown in Figure 1. Our tools for these 
steps are shown in italics for each step, as well as some 
traditional tools. The first step is capturing the workload, 
usually as some type of trace. The second step is con- 
version, usually from some raw format into a format de- 
signed for analysis. The third step is analysis to reduce the 
huge amount of converted data to something manageable. 
Alternately, this step is a simulation or replay to explore 
some new system architecture. Finally the fourth step is 
to generate graphs or textual reports from the output of 
the analysis or simulation. 
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(nettrace2ds, tcpdump, tshark) 
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Analysis/Simulation 


(ipdsanalysis, nfsdsanalysis, scripts, programs) 


y 
Graphing/Reporting 


(mercury-plot, gnuplot, matlab, R) 















































Figure 1: Overall process; our tools are shown in italics, 
traditional tools after them. 


Our work has five main contributions: 


1. The development of techniques for lossless raw 
packet capture up to Gb/s, and with recent hardware 
improvements, likely to 10Gb/s. These techniques 
are applicable to anyone wanting to capture a net- 
work storage service such as NFS, CIFS, or iSCSI. 


2. A series of guidelines for the conversion and storage 
of the traces. Many of these guidelines are things that 
we wish we had known when we were converting our 
traces. We used DataSeries [2] to store the traces, but 
our guidelines are general. 


3. Improved techniques for analyzing very large traces 
that allow us to look at the burstiness in workloads, 
and an examination of how the long averaging inter- 
vals in prior analysis can obscure workload proper- 
ties. 


4. The analysis of an intense NFS workload demon- 
strating that our techniques are successful. 


5. The agreement with the animation company to allow 
the roughly 100 billion operation anonymized traces 
to be published, along with the complete set of tools 
to perform all the analysis presented in this paper and 
to generate the graphs. Other researchers can build 
on our tools for further analysis, and use the traces in 
simulation studies. 


We examine related work in Section 2. We describe our 
capture techniques in Section 3, followed by the conver- 
sion in Section 4. We describe our adopted and new anal- 
ysis techniques in Section 5 and use them to analyze the 
workload in Section 6. Finally we conclude in Section 7. 
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2 Related work 


The two closest pieces of related work are Ellard’s NFS 
study [12, 13], and Leung’s 2007 CIFS study [20]. These 
papers also summarize the earlier decade of filesystem 
tracing, so we refer interested readers to those papers. 
Ellard et al. captured NFS traces from a number of Digital 
UNIX and NetApp servers on the Harvard campus, ana- 
lyzed the traces and presented new results looking at the 
sequentiality of the workload, and comparing his results 
to earlier traces. Ellard made his tools available, so we 
initially considered building on top of them, but quickly 
discovered that our workload was so much more intense 
that his tools would be insufficient, and so ended up build- 
ing our own. We later translated those tools and traces into 
DataSeries, and found our version was about 100 faster 
on a four core machine and used 25 x less CPU time for 
analysis. Our 2003 traces were about 25x more intense 
than Ellard’s 2001 traces, and about 6x more intense than 
Ellard’s 2003 traces. 

Leung et al. traced a pair of NetApp servers on their 
campus. Since the clients were entirely running the Win- 
dows operating system, his traces were of CIFS data, and 
so he used the Wireshark tools [31] to convert the traces. 
Leung’s traces were of comparable intensity to Ellard’s 
traces, and they noted that they had some small packet 
drops during high load as they just used tcpdump for cap- 
ture. Leung identified and extensively analyzed compli- 
cated sequentiality patterns. Our 2007 traces were about 
95 x more intense than Leung’s traces, as they saw a peak 
of 19.1 million operations/day and we saw an average of 
about 1.8 billion. This comparison is slightly misleading 
as NFS tends to have more operations than CIFS because 
NFS is a stateless protocol. 

Tcpdump [30] is the tool that almost all researchers 
describe using to capture packet traces. We tried using 
tcpdump, but experienced massive packet loss using it in 
2003, and so developed new techniques. For compatibil- 
ity, we used the pcap file format, originally developed for 
tcpdump, for our raw captured data. When we captured 
our second set of traces in 2007, we needed to capture at 
even higher rates, and we used a specialized capture card. 
We wrote new capture software using techniques we had 
developed in 2003 to allow us to capture above 5Gb/s. 

Tcpdump also includes limited support for conversion 
of NFS packets. Wireshark [31] provides a graphical in- 
terface to packet analysis, and the tshark variant provides 
conversion to text. We were not aware of Wireshark at the 
time of our first capture, and we simply adjusted our ear- 
lier tools when we did our 2007 tracing. We may consider 
using the Wireshark converter in the future, provided we 
can make it run much faster. Running tshark on a small 2 
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million packet capture took about 45 seconds whereas our 
converter ran in about 5 seconds. Given conversion takes 
2-3 days for a 5 day trace, we can not afford conversion 
to slow down by a factor of 9x. 

Some of the analysis techniques we use are derived 
from the database community, namely the work on 
cubes [16] and approximate quantiles [22]. We consid- 
ered using a standard SQL database for our storage and 
analysis, but abandoned that quickly because a database 
that can hold 100 billion rows is very expensive. We do 
use SQL databases for analysis and graphing once we 
have reduced the data size down to a few million rows 
using our tools. 


3 Raw packet capture 


The first stage in analyzing an NFS workload is capturing 
the data. There are three places that the workload could 
be captured: the client, the server, or the network. Cap- 
turing the workload on the clients is very parallel, but is 
difficult to configure and can interfere with the real work- 
load. Capturing the workload on the server is straightfor- 
ward if the server supports capture, but impacts the per- 
formance of the server. Capturing the workload on the 
network through port mirroring is almost as convenient as 
capture on the server, and given that most switches imple- 
ment mirroring in hardware, has no impact on network or 
workload performance. Therefore, we have always cho- 
sen to capture the data through the use of port mirroring, if 
necessary, using multiple Ethernet ports for the mirrored 
packets. 

The main challenge for raw packet capture is the under- 
lying data rate. In order to parse NFS packets, we have to 
capture the complete packet. Because the capture host is 
not interacting with clients, it has no way to throttle in- 
coming packets, so it needs to be able to capture at the 
full sustained rate or risk packet loss. To maximize flexi- 
bility, we want to write the data out to disk so that we can 
simplify the parsing and improve the error checking. This 
means that all of the incoming data eventually turns in to 
disk writes leading to the second challenge of maximizing 
effective disk space. 

While the 1 second average rates may be low enough 
to fit onto the mirror ports, if the switch has insufficient 
buffering, packets can still be dropped. We discovered 
this problem on a switch that used per-port rather than per- 
card buffering. To eliminate the problem, we switched 
to 10Gbit mirror ports to reduce the need for switch-side 
buffering. 

The capture host can also be overrun. At low data rates 
(900Mb/s, 70,000 packets/s), standard tcpdump on com- 


modity hardware works fine. However, at high data rates 
(5Gb/s, 10° packets/s), traditional approaches are insuf- 
ficient. Indeed, Leung [20] notes difficulties with packet 
loss using tcpdump on a | Gbit mirror port. We have de- 
veloped three separate techniques for packet capture, all 
of which work better than tcpdump: lindump (user-kernel 
ring buffer), driverdump (in-kernel capture to files), and 
endacedump (hardware capture to memory). 


3.1 Lindump 


The Linux kernel includes a memory-mapped, shared ring 
buffer for packet capture. We modified the example lin- 
dump program to write out pcap files [8], the standard 
output format from tcpdump, and to be able to capture 
from more than one interface at the same time. We wrote 
the output files to an in-memory filesystem using mmap to 
reduce copies, and copied and compressed the files in par- 
allel to disk. Using an HP DL580G2, a current 4 socket 
server circa 2003, lindump was able to capture about 3 x 
the packets per second (pps) as tcpdump and about 1.25 x 
the bandwidth. Combined with a somewhat higher burst 
rate while the kernel and network card buffered data, this 
approach was sufficient for mostly loss free captures at 
the animation company, and was the technique we used 
for all of the 2003 set of traces. 

Packets are captured into files in tmpfs, an in-memory 
filesystem, and then compressed to maximize the effective 
disk space. If the capture host is mostly idle, we com- 
pressed with gzip -9. As the backlog of pending files 
increased, we reduced the compression algorithm to gzip 
-6, then to gzip -1, and finally to nothing. In practice 
this approach increased the effective disk size by 1.5-2.5 x 
in our experience as the data was somewhat compressible, 
but at higher input rates we had to fall back to reduced 
compression. 


3.2. Driverdump 


At another site, our 1Gbit lindump approach was insuffi- 
cient because of packet bursts and limited buffering on the 
switch. Replacing the dual 1 Gbit cards with a 10Gb/s card 
merely moved the bottleneck to the host and the packets 
were dropped on the NIC before they could be consumed 
by the kernel. 

To fix this problem, we modified the network driver so 
that instead of passing packets up the network stack, it 
would just copy the packets in pcap format to a file, and 
immediately return the packet buffer to the NIC. A user 
space program prepared files for capture, and closed the 
files on completion. We called our solution driverdump 
since it performed all of the packet dumping in the driver. 
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Because driverdump avoids the kernel IP stack, it can 
capture packets faster than the IP stack could drop them. 
We increased the sustained packets per second over lin- 
dump by 2.25 x to 676,000pps, and sustained bandwidth 
by 1.5x to 170MiB/s (note 1 MiB/s = 27° bytes/s). We 
could handle short bursts up to 900,000 pps, and 215 
MiB/s. This gave us nearly lossless capture to memory 
at the second site. Since the files were written into tmpfs, 
we re-used our technology for compressing and copying 
the files out to disk. 


3.3 Endacedump 


In 2007, we returned to the animation company to collect 
new traces on their faster NFS servers and 10Gb/s net- 
work. While an update of driverdump might have been 
sufficient, we decided to also try the Endace DAG 8.2X 
capture card [14]. This card copies and timestamps pack- 
ets from a 10Gb/s network directly into memory. As a 
result, it can capture minimal size packets at full band- 
width, and is intended for doing in-memory analysis of 
networks. Our challenge was to get the capture out to 
disk, which was not believed to be feasible by our techni- 
cal contacts at Endace. 

To solve this problem, we integrated our adaptive 
compression technique into a specialized capture pro- 
gram, and added the Izf [21] compression algorithm, that 
compresses at about 100MiB/s. We also upgraded our 
hardware to an HP DL585g2 with 4 dual-core 2.8Ghz 
Opterons, and 6 14 disk SCSI trays. Our compression 
techniques turned our 20TiB of disk space into 30TiB of 
effective disk space. We experienced a very small number 
of packet drops because our capture card limited a single 
stream to PCI-X bandwidth (8Gbps), and required parti- 
tioning into two streams to capture 10Gb/s. Newer cards 
capture 10Gb/s in a single stream. 


3.4 Discussion 


Our capture techniques are directly applicable to anyone 
attempting to capture data from a networked storage ser- 
vice such as NFS, CIFS, or iSCSI. The techniques present 
a tradeoff. The simplest technique (lindump), is a drop 
in replacement for using tcpdump for full packet cap- 
ture, and combined with our adaptive compression al- 
gorithm allows capture at over twice the rate of native 
tcpdump and expands the effective size of the disks by 
1.5x. The intermediate technique increases the capture 
rates by an additional factor of 2-3 x, but requires modifi- 
cation of the in-kernel network driver. Our most advanced 
techniques are capable of lossless full-packet capture at 
10Gb/s, but requires purchasing special capture hardware. 
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Both the lindump and driverdump code are available in 
our source distribution [9]. These tools and techniques 
should eliminate problems of packet drops for capturing 
storage traces. Further details and experiments with the 
first two techniques can be found in [1]. 


4 Conversion from raw format 


Once the data is captured, the second problem is parsing 
and converting that data to a easily usable format. The 
raw packet format contains a large amount of unnecessary 
data, and would require repeated, expensive parsing to be 
used for NFS analysis. There are four main challenges 
in conversion: representation, storage, performance and 
anonymization. Data representation is the challenge of 
deciding the logical structure of the converted data. Stor- 
age format is the challenge of picking a suitable physical 
structure for the converted data. Conversion performance 
is the challenge of making the conversion run quickly, ide- 
ally faster than the capture stage. Trace anonymization is 
the challenge of hiding sensitive information present in 
the data and is necessary for being able to release traces. 

One lesson we learned after conversion is that the con- 
verter’s version number should be included in the trace. 
As with most programs, there can be bugs. Having the 
version number in the trace makes it easy to determine 
which flaws need to be handled. For systems such as sub- 
version or git, we recommend the atomic check-in ID as 
a suitable version number. 

A second lesson was preservation of data. An NFS 
parser will discard data both for space reasons and for 
anonymization. Keeping underlying information, such as 
per packet conversion in addition to per NFS-request con- 
version can enable cross checking between analysis. We 
caught an early bug in our converter that failed to record 
packet fragments by comparing the packet rates and the 
NFS rates. 


4.1 Data representation 


One option for the representation is the format used in the 
Ellard [11] traces: one line per request or reply in a text 
file with field names to identify the different parameters in 
the RPC. This format is slow to parse, and works poorly 
for representing readdir, which has an arbitrary number 
of response fields. Therefore, we chose to use a more 
relational data structuring [7]. 

We have a primary data table with the common fields 
present in every request or reply, and an identifier for each 
RPC. We then have secondary tables that contain request- 
type specific information, such as a single table for RPC’s 
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that include attributes, and a single table for read and write 
information. We then join the common table to the other 
tables when we want to perform an analysis that uses in- 
formation in both. Because of this structure, a single RPC 
request or reply will have a single entry in the common ta- 
ble. However, a request/reply pair will have zero (no entry 
in the read/write table unless the operation is a read/write) 
or more entries (multiple attribute entries for readdir+) in 
other tables. 

The relational structuring improves flexibility, and 
avoids reading unnecessary data for analyses that only 
need a subset of the data. For example, an analysis only 
looking at operation latency can simply scan the common 
table. 


4.2 Storage format 


Having decided to use a relational structuring for our data, 
we next needed to decide how to physically store the 
data. Three options were available to us: text, SQL, and 
DataSeries, our custom binary format [2] for storing trace 
data. Text is a traditional way of storing trace data, how- 
ever, we were concerned that a text representation would 
be too large and too slow. Having later converted the 
Ellard traces to our format, we found that the analysis dis- 
tributed with the traces used 25 x less CPU time when the 
traces and analysis used DataSeries, and ran 100 faster 
on a4 core machine. This disparity confirmed our intu- 
ition that text is a poor format for trace data. 

SQL databases support a relational structure. How- 
ever, the lack of extensive compression means that our 
datasets would consume a huge amount of space. We also 
expected that many complex queries would not benefit 
from SQL and would require extracting the entire tables 
through the slow SQL connection. 

Therefore, we selected DataSeries as an efficient and 
compact format for storing traces. It uses a relational data 
model, so there are rows of data, with each row comprised 
of the same typed columns. A column can be nullable, 
in which case there is a hidden boolean field for storing 
whether the value is null. Groups of rows are compressed 
as aunit. Prior to compression, various transforms are ap- 
plied to reduce the size of the data. First, duplicate strings 
are collapsed down to a single string. Second, values are 
delta compressed relative to either the same value in the 
previous row or another value in the same row. For exam- 
ple, the packet time values are delta compressed, making 
them more compressible by a general purpose compres- 
sion algorithm. 

DataSeries is designed for efficient access. Values are 
packed so that once a group of rows is read in, an anal- 
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ysis can iterate over them simply by increasing a single 
counter, as with a C++ vector. Individual values are ac- 
cessed by an offset from that counter and a C++ cast. Byte 
swapping is automatically performed if necessary. The 
offset is not fixed, so the same analysis can read different 
versions of the data, provided the meaning of the fields 
has not changed. Efficient access to subsets of the data is 
supported by an automatically generated index. 


DataSeries is designed for generality. It supports ver- 
sioning on the table types so that an analysis can properly 
interpret data that may have changed in meaning. It has 
special support for time fields so that analysis can convert 
to and from different raw formats. 


DataSeries is designed for integrity. It has internal 
checksums on both the compressed and the uncompressed 
data to validate that the data has been processed appropri- 
ately. Additional details on the format, additional trans- 
forms, and comparisons to a wide variety of alternatives 
can be found in the technical report [10]. 


4.3 Conversion performance 


To perform the conversion in parallel, we divide the col- 
lected files into groups and process each group separately. 
We make two passes through the data. First, we parse the 
data and count the number of requests or replies. Second, 
we use those counts to determine the first record-id for 
each group, and convert the files. Since NFS parsing re- 
quires the request to parse the reply, we currently do not 
parse any request-reply pairs that cross a group bound- 
ary. Similarly, we do not do full TCP reconstruction, so 
for NFS over TCP, we parse multiple requests or replies 
if the first one starts at the beginning of the packet. These 
limitations are similar to earlier work, so we found them 
acceptable. We run the conversion locally on the 8-way 
tracing machine rather than a cluster because conversion 
runs faster than the 1Gbit LAN connection we had at the 
customer site (the tracing card does not act as a normal 
NIC). Conversion of a full data set (30TiB) takes about 3 
days. 


We do offline conversion from trace files, rather than 
online conversion, primarily for simplicity. However, a 
side benefit was that our converter could be paranoid and 
conservative, rather than have it try to recover from con- 
version problems, since we could fix the converter when 
it was mis-parsing or was too conservative. The next time 
we trace, we plan to do more on-the-fly conversion by 
converting early groups and deleting those trace files dur- 
ing capture so that we can capture longer traces. 
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4.4 Trace anonymization 


In order to release the traces, we have to obscure private 
data such as filenames. There are three primary ways to 
map values in order to anonymize them: 


1. unique integers. This option results in the most 
compact identifiers (< 8 bytes), but is difficult to cal- 
culate in parallel and requires a large translation table 
to maintain persistent mappings and to convert back 
to the original data. 


2. hash/HMAC. This option results in larger identifiers 
(16-20 bytes), but enables parallel conversion. A 
keyed HMAC [5] instead of a hash protects against 
dictionary attacks. Reversing this mapping requires 
preserving a large translation table. 


3. encrypted values. This option results in the longest 
identifiers since the encrypted value will be at least 
as large as the original value. It is parallizable and 
easily reversible provided the small keys are main- 
tained. 


We chose the last approach because it preserved the 
maximum flexibility, and allowed us to easily have dis- 
cussions with the customer about unexpected issues such 
as writes to what should have been a read-only filesys- 
tem. Our encryption includes a self-check, so we can con- 
vert back to real filenames by decrypting all hexadecimal 
strings and keeping the ones that validate. We have also 
used the reversibility to verify for a colleague that they 
properly identified the ‘” and ‘..’ filenames. 

We chose to encrypt entire filenames since the suffixes 
are specific to the animation process and are unlikely to be 
useful to people. This choice also simplified the discus- 
sions about publishing the traces. Since we can decrypt, 
we could in the future change this decision. 

The remaining values were semi-random (IP addresses 
in the 10.* network, filehandles selected by the NFS 
servers), SO we pass those values through unchanged. 
We decided that the filehandle content, which includes 
for our NFS servers the filesystem containing the file, 
could be useful for analysis. Filehandles could also be 
anonymized. 

All jobs in the customers’ cluster were being run as a 
common user, so we did not capture user identifiers. Since 
they are transitioning away from that model, future traces 
would include unchanged user identifiers and group iden- 
tifiers. If there were public values in the traces, then we 
would have had to apply more sophisticated anonymiza- 
tion [27]. 
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5 Analysis techniques 


Analyzing the very large amount of data that we col- 
lected required us to adopt and develop new analysis 
techniques. The most important property that we aimed 
for was bounded memory, which meant that we needed 
to have streaming analysis. The second property that 
we wanted was efficiency, because without compute-time 
efficiency, we would not be able to analyze complete 
datasets. One of our lessons is that these techniques al- 
lowed us to handle the much larger datasets that we have 
collected. 


5.1 Approximate quantiles 


Quantiles are better than simple statistics or histograms 
because they do not accidentally combine separate mea- 
surements regardless of distribution. Unfortunately, for 
our data, calculating exact quantiles is impractical. For a 
single dataset, we collect multiple statistics with a total of 
about 200 billion values. Storing all these values would 
require ~1.5TiB of memory, which makes it impractical 
for us to calculate exact quantiles. 

However, there is an algorithm from the database field 
for calculating approximate quantiles in bounded mem- 
ory [22]. A q—quantile of a set of n data elements is 
the element at position [gq *n] in the sorted list of ele- 
ments indexed from 1 to n. For approximate quantiles, 
the user specifies two numbers €, the maximum error, and 
N, the maximum number of elements. Then when the 
program calculates quantile q, it actually gets a quantile 
in the range |g —€,q +€]. 

Provided that the total number of elements is less than 
N, the bound is guaranteed. We have found that usu- 
ally the error is about 10x better than specified. For our 
epsilon of 0.005 (sufficient to guarantee all percentiles 
are distinct), instead of needing ~1.5TiB, we only need 
-1.2MiB, an improvement of > 10°. This dramatic im- 
provement means we can run the analysis on one machine, 
and hence process multiple sets in parallel. The perfor- 
mance cost of the algorithm is about the same as sort- 
ing since the algorithm does similar sorting of subsets and 
merging of subsets. Details on how the algorithm works 
can be found in [22] or our software distribution. 


5.2. Data cube 


Calculating aggregate or roll-up statistics is an important 
part of analyzing a workload. For example, consider the 
information in the common NFS table: (time, operation, 
client-id, and server-id). We may want to calculate the to- 
tal number of operations performed by client 5, in which 
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case we want to count the number of rows that match 
Ur * 5, #) 

The cube [16] is a generalization of the group-by op- 
erations described above. Given a collection of rows, 
it calculates the set of unique values for each column 
U(c), adds the special value ‘ANY’ to the set, and then 
generates one row for each member of the cross-product 
U(1) x U(2) x ...U(n). 

We implemented an efficient templated version of the 
cube operator for use in data analysis. We added three fea- 
tures to deal with memory usage. First, our cube can only 
include rows with actual values in it. This eliminates the 
large number of rows from the cross-product that match 
no rows in the base data. Second, we can further restrict 
which rows are generated. For example, we have a large 
number of client id’s, and so we can avoid cubing over en- 
tries with both the client and operation specified to reduce 
the number of statistics calculated. Third, we added the 
ability to prune values out of the cube. For example, we 
can output cube values for earlier time values and remove 
them from the data structure once we reach later time val- 
ues since we know the data is sorted by time. 

The cube allows us to easily calculate a wide variety of 
summary statistics. We had previously manually imple- 
mented some of the summary statistics by doing explicit 
roll-ups for some of the aggregates described in the ex- 
ample. We discovered that the general implementation 
was actually more efficient than our manual one because 
it used a single hash table for all of the data rather than 
nested data structures, and because we tuned the hash 
function over the tuple of values to be calculated effi- 
ciently. 


5.3. HashTable 


Our hash-table implementation [9] is a straightforward 
chained-hashing implementation. In our experiments it 
is strictly better in both performance and memory than 
the GNU C++ hash table. It uses somewhat more mem- 
ory than the Google sparse hash [15], but performs al- 
most as well as the dense hash; it is strictly faster than the 
g++ STL hash. We added three unusual features. First, 
it can calculate its memory usage, allowing us to deter- 
mine what needs to be optimized. Second, it can partially 
reset iterators, which allows for safe mutating operations 
on the hash table during iteration, such as deleting a sub- 
set of the values. Third, it can return the underlying hash 
chains, allowing for sorting the hash table without copy- 
ing the values out. This operation destroys the hash table, 
but the sort is usually done immediately before deleting 
the table, and reduces memory usage by 2. 
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5.4 Rotating hash-map 


Limiting memory usage for hash tables where the entries 
have unknown lifespan presents some challenges. Con- 
sider the sequentiality metric: so long as accesses are ac- 
tive to the file, we want to continue to update the run infor- 
mation. Once the file becomes inactive for long enough, 
we want to calculate summary statistics and remove the 
general statistics from memory. We could keep the values 
in an LRU data-structure. However if our analysis only 
needs a file id and last offset, then the forward and back- 
wards pointers for LRU would double the memory usage. 
A clock-style algorithm would require regular full scans 
of the entire data structure. 

We instead solve this problem by keeping two hash- 
maps, the recent and old hash-maps. Any time a value 
is accessed, it is moved to the recent hash-map if it is 
not already there. At intervals, the program will call 
the rotate(fn) operation which will apply fn to all of the 
(key, value) pairs in the old hash map, delete that map, as- 
sign the recent map to the old map and create a new recent 
map. 

Therefore, if the analysis wants to guarantee any gap of 
up to 60 seconds will be considered part of the same run, 
it just needs to call rotate() every 60 seconds. Any value 
accessed in the last 60 seconds will remain present in the 
hash-map. We could reduce the memory overhead some- 
what by keeping more than two hash-maps at the cost of 
additional lookups, but we have so far found that the ro- 
tating hash-map provides a good tradeoff between mini- 
mizing memory usage and maximizing performance. We 
believe that the LRU approach would be more effective 
if the size of the data stored in the hash map were larger, 
and the hash-map could compact itself so that scattered 
data entries do not consume excess space. 


5.5 Graphing with mercury-plot 


Once we have summarized the data from DataSeries 
using the techniques described above, we need to graph 
and subset the data. We combined SQL, Perl, and gnuplot 
into a tool we call mercury-plot. SQL enables sub-setting 
and combining data. For example if we have data on 60 
second intervals, it is easy to calculate min/mean/max 
for 3600 second intervals, or with the cube to select out 
the subset of the data that we want to use. We use Perl 
to handle operations that the database can not handle. 
For example, in the cube, we represent the ‘ANY’ value 
as null, but SQL requires a different syntax to select 
for null vs. a specific value. We hide this difference in 
the Perl functions. In practice, this allows us to write 
very simple commands such as plot quantile as 
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x, value as y from nfs_hostinfo_cube where 
operation = 'read’ and direction = ’send’ to 
generate a portion of the graph. This tool allows us to 
deal with the millions of rows of output that can come 
from some of the analysis. To ease injection of data from 
the C++ DataSeries analysis, one lesson we learned is 
analysis should have a mode that generates SQL insert 
statements in addition to human readable output. 


6 Analysis 


Analyzing very large traces can take a long time. While 
our custom binary format enables efficient analysis, and 
our analysis techniques are efficient, it can still take 4-8 
hours to analyze a single set of the 2007 traces. In prac- 
tice, we analyze the traces in parallel on a small cluster 
of four core 2.4GHz Opterons. Our analysis typically be- 
comes bottle-necked on the file servers that serve up to 
200MiB/s each once an analysis is running on more than 
20 machines. 

We collected data at two times: August 2003 - Febru- 
ary 2004 (anim-2003), and January 2007 - October 2007 
(anim-2007). We collected data using a variety of mir- 
ror ports within the company’s network. The network de- 
sign is straightforward: there is a redundant set of core 
routers, an optional mid-tier of switches to increase the 
effective port count of the core, and then a collection of 
edge switches that each cover one or two racks of render- 
ing machines. Most of our traces were taken by mirror- 
ing links between rendering machines and the rest of the 
network. For each collected dataset, we would start the 
collection process, and let it run either until we ran out of 
disk space, or we had collected all the data we wanted. 
Each of these runs comprises a set. We have 21 sets from 
2003, and 8 sets from 2007. 

We selected a subset of the data to present, two datasets 
from 2003 and two from 2007. The sets were selected 
both because they are representative of the more inten- 
sive traces from both years, and to show some variety in 
the data. We identified clients as hosts that sent requests, 
servers as hosts that sent replies, and caches as hosts that 
acted as both clients and servers. Further information on 
each dataset can be found on the trace download page [4]. 


e anim-2003/set-5: A trace of 79 clients accessing 50 
NFS servers. NFS caches are seen as servers in this 
trace. 


e anim-2003/set-12: A trace of 1634 clients accessing 
1 NFS server. NFS caches are seen as clients in this 
trace. 
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e anim-2007/set-2: A trace of 273 clients accessing 40 
NFS servers at the same site as anim-2003/set-5. The 
other traces of clients at this site are similar to set-2. 


anim-2007/set-5: A trace of 135 clients accessing 
50 NFS servers, and 8 caches acting as both clients 
and servers, although because of the port mirroring 
setup, we did not see some of the responses from the 
caches. This trace is at a different site from set-2 and 
shows higher burstiness. 


6.1 Capture performance 


We start our analysis by looking at the performance of our 
capture tool. This validates our claims that we can capture 
packets at very high data rates. We examine the capture 
rate of the tool by calculating the megabits/s (Mbps) and 
kilo-packets/s (kpps) for 20 overlapping sub-intervals of a 
specified length. For example if our interval length is 60 
seconds, then we will calculate the bandwidth for the in- 
terval 0s-60s, 3s-63s, 6s-66s, ... end-of-trace. We chose to 
calculate the bandwidth for overlapping intervals so that 
we would not incorrectly measure the peaks and valleys of 
cyclic patterns aligned to the interval length. We use the 
approximate quantile so we can summarize results with 
billions of underlying data points. For example, we have 
11.6 billion measurements for anim-2007/set-0 at a 1ms 
interval length. This corresponds to the 6.7 days of that 
trace. 

Figure 2(a) shows anim-2007/set-5 at different interval 
lengths. This graph shows the effectiveness of our tracing 
technology, as we have sustained intervals above 3Gb/s 
(358MiB/s), and Ims intervals above 4Gb/s (476MiB/s). 
Indeed these traces show the requirement for high speed 
tracing, as 5-20% of the trace intervals have sustained in- 
tervals above 1Gbit, which is above the rate at which Le- 
ung [20] noted their tracing tool started to drop packets. 
The other sets from anim-2007 are somewhat less bursty, 
and the anim-2003 data shows much lower peaks because 
of our more limited tracing tools, and a wider variety of 
shapes, because we traced at more points in the network. 

Figure 2(a) also emphasizes how bursty the traffic was 
during this trace. While 50% of the intervals were above 
SOOMbit/s for 60s intervals, only 30% of the intervals 
were above 500Mbit/s for lms intervals. This bursti- 
ness is expected given that general Ethernet and filesys- 
tem traffic have been shown to be self-similar [17, 19], 
which implies the network traffic is also bursty. It does 
make it clear that we need to look at short time intervals 
in order to get an accurate view of the data. 

Figure 2(b) shows the tail of the distributions for the 
capture rates for two of the trace sets. The relative sim- 
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Table 1: symlink, rmdir, mkdir, and rename were pruned as there were fewer than | million operations; fsinfo, link, 
null, create, remove, and setattr were pruned as there were fewer than 10 million operations. The Mops column could 
be calculated from nfsstat, but the bytes/op column could not. 
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Figure 3: Operation rates, as quantiles, for anim-2003, anim-2007. 
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Figure 4: Bandwidth for reads and operation rate for 
getattrs in the four traces. 


ilarity between the Mbps and kpps graphs is simply be- 
cause packet size distributions are relatively constant. The 
traces show the remarkably high burstiness of the 2007 
traces. While 90% of the Ims intervals are below 2Gb/s, 
0.1% are above 6Gb/s. We expect we would have seen 
slightly higher rates, but because of our configuration er- 
ror for the 2007 capture tool, we could not capture above 
about 8Gb/s. 


6.2 Basic NFS analysis 


Examining the overall set of operations used by a work- 
load provides insight into what operations need to be op- 
timized to support the workload. Examining the distribu- 
tion of rates for the workload tells us if the workload is 
bursty, and hence we need to handle a higher rate than 
would be implied by mean arrival rates, and if there are 
periods of idleness that could be exploited. 

Table 1 provides an overview of all the operations that 
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Table 2: Operation rate ratios 


occurred in the four traces we are examining in more de- 
tail. It shows a number of substantial changes in the work- 
load presented to the NFS subsystem. First, the read and 
write sizes have almost doubled from the anim-2003 to 
anim-2007 datasets. This trend is expected, because the 
company moved from NFSv2 to NFSv3 between the two 
tracing periods, and set the v3 read/write size to 16KiB. 
The company told us they set it to that size based on per- 
formance measurements of sequential I/O. The NFS ver- 
sion switch also accounts for the increase in access calls 
(new in v3), and readdirplus (also new in v3). 

We also see that this workload is incredibly read-heavy. 
This is expected; the animation workload reads a very 
large number of textures, models, etc. to produce a rel- 
atively small output frame. However, we believe that our 
traces under-estimate the number of write operations. We 
discuss the write operation underestimation below. The 
abnormally low read size for set-12 occurred because that 
server was handling a large number of stale filehandle re- 
quests. The replies were therefore small and pulled down 
the bytes/operation. We see a lot more getattr operations 
in set-5 than set-12 because set-12 is a server behind sev- 
eral NFS-caches, whereas set-5 is the workload before the 
NFS-caches. 

Table 2 and Figures 3(a,b) show how long averaging in- 
tervals can distort the load placed on the storage system. If 
we were to develop a storage system for the hourly loads 
reported in most papers, we would fail to support the sub- 
stantially higher near peak (99%) loads seen in the data. 
It also hides periods of idleness that could be used for in- 
cremental scrubbing and data reorganization. We do not 
include the traditional graph of ops/s vs. time because our 
workload does not show a strong daily cycle. Animators 
submit large batches of jobs in the evening that keep the 
cluster busy until morning, and keep the cluster busy dur- 
ing the day submitting additional jobs. Since the jobs are 
very similar, we see no traditional diurnal pattern in the 
NFS load, although we do see the load go to zero by the 
end of the weekend. 

Figure 4 shows the read operation MiB/s and the getattr 
operations/s. It shows that relative to the amount of data 
being transferred, the number of getattrs has been re- 
duced, likely a result of the transition from NFSv2 to 


USENIX Association 


USENIX Association 


NFSv3. The graph shows the payload data transferred, 
so it includes the offset and filehandle of the read request, 
and the size and data in the reply, but does not include 
IP headers or NFS RPC headers. It shows that the NFS 
system is driven heavily, but not excessively. The write 
operations/s graph (not shown for space reasons) implies 
that the write bandwidth has gotten more bursty, but has 
stayed roughly constant. 


This result led us to further analyze the data. We 
were surprised that write bandwidth did not increase, even 
though it is not implausible, as the frame output size has 
not increased. We analyzed the traces to look for missing 
operations in the sequence of transaction ids, automati- 
cally inferring if the client is using a big-endian or little- 
endian counter. The initial results looked quite good: 
anim-2007/set-2 showed 99.7% of the operations were in 
sequence, anim-2007/set-5 showed 98.4%, and counting 
the skips of 128 transactions or less, we found only 0.21% 
and 0.50% respectively (the remaining entries were dupli- 
cates or ones that we could not positively tell if they were 
in sequence or a skip). However, when we looked one 
level deeper at the operation that preceded a skip in the 
sequence, we found that 95% of the skips followed a write 
operation for set-2, and 45% for set-5. The skips in set-2 
could increase the write workload by a factor of 1.5x if 
all missing skips after writes are associated with writes. 
We expected a fair number of skips for set-5 since we ex- 
perienced packet loss under load, but we did not expect it 
for set-2. 


Further examination indicated that the problem came 
about because we followed the same parsing technique for 
TCP packets as was used in nfsdump2 [11]. We started at 
the beginning of the packet and parsed all of the RPCs 
that we found that matched all required bits to be RPCs. 
Unfortunately, over TCP, two back to back writes will 
not align the second write RPC with the packet header, 
and we will miss subsequent operations until they re-align 
with the packet start. While the fraction of missing op- 
erations is small, they are biased toward writes requests 
and read replies. Since we had saved IP-level trace in- 
formation as well as NFS-level, we could write an analy- 
sis that conservatively calculated the bytes of IP packets 
that were not associated with an NFS request or reply. 
Counting a direction of a connection if it transfers over 
10° bytes, we found for anim-2007/set-2 that we can ac- 
count for >90% of the bytes for 87% of the connections, 
and for anim-2007/set-5 that we can account for >90% 
of the bytes for >70% of the connections. The greater 
preponderance of missing bytes relative to missing opera- 
tions reinforces our analysis above that the losses are due 
to non-aligned RPC’s since we are missing very few op- 
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Figure 5: File size distribution for all accessed files. 


erations, but many more bytes, and reads and writes have 
a high byte to operation ratio. 

While this supports our lesson that retaining lower level 
information is valuable, this analysis also leads us to an- 
other one of our lessons: extensive validation of the con- 
version tool is important. Both validation through valida- 
tion statistics, and through the use of a known workload 
that exercises the capture tools. An NFS replay tool [32] 
could be used to generate a workload, the replayed work- 
load could be captured, and the capture could be com- 
pared to the original replayed workload. This comparison 
has been done to validate a block based replay tool [3], 
but has not been done to validate an NFS tracing tool, as 
the work has simply assumed tracing was correct. We be- 
lieve a similar flaw is present in earlier traces [11] because 
the same parsing technique was used, although we do not 
know how much those traces were affected. 


6.3 File sizes 


File sizes affect the potential internal fragmentation for a 
filesystem. They affect the maximum size of I/Os that can 
be executed, and they affect the potential sequentiality in 
a workload. 

Figure 5 shows the size of files accessed in our traces. 
It shows that most files are small enough to be read in 
a single I/O: 40-80% of the files are smaller than 8KiB 
(NFSv?2 read size) for the 2003 traces, and 70% of the files 
are smaller than 16KiB for the 2007 traces. While there 
are larger files in the traces, 99% of the files are smaller 
than 10MiB. The small file sizes present in this workload, 
and the preponderance of reads suggest that a flash file 
system [18] or MEMS file system [29] could support a 
substantial portion of the workload. 
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Figure 6: Number of reads or sequential bytes in a single 
group (more than 30s gap between I/Os); 


6.4 Sequentiality 


Sequentiality is one of the most important properties for 
storage systems because disks are much more efficient 
when handling sequential data accesses. Prior work has 
presented various methods for calculating sequentiality. 
Both Ellard [12] and Leung [20] split accesses into groups 
and calculate the sequentiality within the group. Ellard 
emulates opens and closes by looking for 30s groups in 
the access pattern. Ellard tolerates small gaps in the re- 
quest stream as sequential, e.g. an I/O of 7KiB at offset 
0 followed by an I/O of 8KiB at offset 8KiB would be 
considered sequential. 

Ellard also reorders I/Os to deal with client-side 
reordering. In particular, Ellard looks forward a constant 
amount from the request time to find I/Os that could make 
the access pattern more sequential. This constant was de- 
termined empirically. Leung treats the first I/O after an 
open as sequential, essentially assuming that the server 
will prefetch the first few bytes in the file or that the file 
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is contiguous with the directory entry as with immediate 
files [24]. For NFS, the server may not see a lookup before 
a read, depending on whether the client has used readdir+ 
to get the filehandle instead of a lookup. 

We determine sequentiality by reordering within tem- 
porally overlapping requests. Given two I/Os, A and B, 
if the request-reply intervals overlap, then we are willing 
to reorder the requests to improve estimated sequentiality. 
We believe this is a better model because the NFS server 
could reorder those I/Os. In practice, Figure 7 shows that 
for our traces this reordering makes little difference. Al- 
lowing reordering an additional 10ms beyond the reply of 
I/O A slightly increases the sequentiality, but generally 
not much more than just for overlapping requests. 

We also decide on whether the first I/O is sequential or 
random based on additional I/Os. If the second I/O (after 
any reordering) is sequential to the first one, than the first 
I/O is sequential, otherwise it is random. If there is only 
one I/O to a particular file, then we consider the I/O to be 
random since the NFS server would have to reposition to 
that file to start the read. 

Given our small file sizes, it turns out that most ac- 
cesses count as random because they read the entire file in 
a single I/O. We can see this in Figure 6(a), which shows 
the number of reads in a group. Most groups are single 
I/O groups (70-90% in the 2007 traces). We see about 
twice as many I/Os in the 2003 traces, because the I/Os in 
the 2003 traces are only 8KiB, rather than 16KiB. 

Sequential runs within a random group are more inter- 
esting. Figure 6(b) shows the number of bytes accessed in 
sequential runs within a random group. We can see that if 
we start accessing a file at random, most (50-80%) of the 
time we will do single or double I/O accesses (8-32KiB). 
However we also get some extended runs within a random 
group, although 99% of the runs are less than 1MiB. 


7 Conclusions 


We have described three improved techniques for packet 
capture on networks. The easily adopted technique should 
allow anyone capturing NFS, CIFS, or iSCSI traffic from 
moderate performance storage systems (<1Gbit) to cap- 
ture traffic with no losses. The most advanced tech- 
nique allows lossless capture for 5-10Gbit storage sys- 
tems, which is at the high end of most file storage sys- 
tems. The primary lesson from this part of the work is 
that lossless 1Gbit packet capture is straightforward and 
up to 10Gbit is possible with an investment in develop- 
ment time or specialized hardware. 

We have provided guidelines for conversion for fu- 
ture practitioners: parallelizing the conversion, retaining 
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Figure 7: Each line group shows host estimated sequen- 
tiality was affected by allowing, in order: reordering 
of I/Os within 10ms of the reply, reordering within the 
request-reply window, or no reordering. The small hori- 
zontal change shows that reordering this workload has a 
negligible effect on sequentiality. 


lower-level information, using reversible anonymization, 
approaches for testing the conversion tools, and tagging 
the trace data with version information. 


We have described our binary storage format, which 
uses chunked compression with multiple possible com- 
pression techniques, typed relational-style data structur- 
ing, delta encoding, and type-safe, high-speed accessors. 
It improves over prior storage formats by up to 100x. 


We have described our techniques for improved mem- 
ory and performance efficiency to enable analysis of very 
large data sets. We explained the cube and approximate 
quantile techniques that we adopted from the database lit- 
erature, and our hashtable, rotating hash-map, and plot- 
ting techniques that we use for analyzing the data. 


We have analyzed our NFS workload examining some 
of the different properties found in a feature animation 
workload and demonstrating that our techniques are effec- 
tive. We found that our workload had much more activity 
than previously described workloads, and that the file size 
and sequentiality is different than those workloads. 


The tools described in this paper are available 
as open source from http://tesla.hpl.hp.com/ 
opensource/, and the traces are available from 
http://apotheca.hpl.hp.com/pub/datasets/ 
animation-bear/. 
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Abstract 


The scale of today’s storage systems has made it in- 
creasingly difficult to find and manage files. To address 
this, we have developed Spyglass, a file metadata search 
system that is specially designed for large-scale storage 
systems. Using an optimized design, guided by an anal- 
ysis of real-world metadata traces and a user study, Spy- 
glass allows fast, complex searches over file metadata to 
help users and administrators better understand and man- 
age their files. 

Spyglass achieves fast, scalable performance through 
the use of several novel metadata search techniques that 
exploit metadata search properties. Flexible index con- 
trol is provided by an index partitioning mechanism that 
leverages namespace locality. Signature files are used 
to significantly reduce a query’s search space, improving 
performance and scalability. Snapshot-based metadata 
collection allows incremental crawling of only modified 
files. A novel index versioning mechanism provides both 
fast index updates and “back-in-time” search of meta- 
data. An evaluation of our Spyglass prototype using our 
real-world, large-scale metadata traces shows search per- 
formance that is 1-4 orders of magnitude faster than ex- 
isting solutions. The Spyglass index can quickly be up- 
dated and typically requires less than 0.1% of disk space. 
Additionally, metadata collection is up to 10x faster than 
existing approaches. 


1 Introduction 


The rapidly growing amounts of data in today’s stor- 
age systems makes finding and managing files extremely 
difficult. Storage users and administrators need to effi- 
ciently answer questions about the properties of the files 
being stored in order to properly manage this increas- 
ingly large sea of data. Metadata search, which involves 
indexing file metadata such as inode fields and extended 
attributes, can help answer many of these questions [26]. 
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Metadata search allows point, range, top-k, and aggre- 
gation search over file properties, facilitating complex, 
ad hoc queries about the files being stored. For exam- 
ple, it can help an administrator answer “which files can 
be moved to second tier storage?” or “which applica- 
tion’s and user’s files are consuming the most space?” 
Metadata search can also help a user find his or her ten 
most recently accessed presentations or largest virtual 
machine images. Efficiently answering these questions 
can greatly improve how user and administrator manage 
files in large-scale storage systems. 


Unfortunately, fast and efficient metadata search in 
large-scale storage systems is difficult to achieve. Both 
customer discussions [37] and personal experience have 
shown that existing enterprise search tools that provide 
metadata search [4, 14, 17,21,30] are often too expen- 
sive, slow, and cumbersome to be effective in large-scale 
systems. Effective metadata search must meet several 
requirements. First, it must be able to quickly gather 
metadata from the storage system. We have observed 
commercial systems that took 22 hours to crawl 500 GB 
and 10 days to crawl 10 TB. Second, search and update 
must be fast and scalable. Existing systems typically 
index metadata in a general-purpose DBMS. However, 
DBMSs are not a perfect fit for metadata search, which 
can limit their performance and scalability in large-scale 
systems. Third, resource requirements must be low. Ex- 
isting tools require dedicated CPU, memory, and disk 
hardware, making them expensive and difficult to inte- 
grate into the storage system. Fourth, the search inter- 
face must be flexible and easy to use. Metadata search 
enables complex file searches that are difficult to ask 
with existing file system interfaces and query languages. 
Fifth, search results must be secure; many existing sys- 
tems either ignore file ACLs or significantly degrade per- 
formance to enforce them. 

To address these issues, we developed Spyglass, a 
novel metadata search system that exploits file metadata 
properties to enable fast, scalable search that can be em- 
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bedded within the storage system. To guide our design, 
we collected and analyzed file metadata snapshots from 
real-world storage systems at NetApp and conducted a 
survey of over 30 users and IT administrators. Our de- 
sign introduces several new metadata search techniques. 
Hierarchical partitioning is anew method of namespace- 
based index partitioning that exploits namespace local- 
ity to provide flexible control of the index. Signature 
files are used to compactly describe a partition’s con- 
tents, helping to route queries only to relevant partitions 
and prune the search space to improve performance and 
scalability. A new snapshot-based metadata collection 
method provides scalable collection by re-crawling only 
the files that have changed. Finally, partition versioning, 
a novel index versioning mechanism, enables fast up- 
date performance while allowing “back-in-time” search 
of past metadata. Spyglass does not currently address 
search interface or security, which are left to future work. 

An evaluation of our Spyglass prototype, using our 
real-world, large-scale metadata traces, shows that 
search performance is improved 1-4 orders of magni- 
tude compared to basic DBMS setups. Additionally, 
search performance is scalable; it is capable of search- 
ing hundreds of millions of files in less than a second. 
Index update performance is up to 40x faster than basic 
DBMS setups and scales linearly with system size. The 
index itself typically requires less than 0.1% of total disk 
space. Index versioning allows “back-in-time” metadata 
search while adding only a tiny overhead to most queries. 
Finally, our snapshot-based metadata collection mecha- 
nism performs 10x faster than a straw-man approach. 
Our evaluation demonstrates that Spyglass can leverage 
file metadata properties to improve how files are man- 
aged in large-scale storage systems. 

This remainder of this paper is organized as follows. 
Section 2 provides additional metadata search motivation 
and background. Section 3 presents the Spyglass design. 
Our prototype is evaluated in Section 4. Related work is 
discussed in Section 5, with future work and conclusions 
in Section 6. 


2 Background 


This section describes and motivates the use of file meta- 
data search and includes a discussion of real-world query 
and metadata characteristics. 


2.1 File Metadata 


File metadata, such as inode fields (e.g., size, owner, 
timestamps, efc.), generated by the storage system and 
extended attributes (e.g., document title, retention policy, 
backup dates, efc.), generated by users and applications, 
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is typically represented as (attribute, value) pairs that de- 
scribe file properties. Today’s storage systems can con- 
tain millions to billions of files, and each file can have 
dozens of metadata attribute-value pairs, resulting in a 
data set with 10'° — 10!! total pairs. 

The ability to search file metadata facilitates complex 
queries on the properties of files in the storage system, 
helping administrators understand the kinds of files being 
stored, where they are located, how they are used, how 
they got there (provenance), and where they should be- 
long. For example, finding which files to migrate to tape 
may involve searching file size, access time, and owner 
metadata attributes, allowing administrators to decide on 
and enforce their management policies. Metadata search 
also helps users locate misplaced files, manage their stor- 
age space, and track file changes. As a result, metadata 
search tools are becoming more prevalent; recent reports 
state that 37% of enterprise businesses use such tools and 
40% plan to do so in the near future [12]. 

To better understand metadata search needs, we sur- 
veyed over 30 large scale storage system users and ad- 
ministrators. We found subjects using metadata search 
for a wide variety of purposes. Use cases included 
managing storage tiers, tracking legal compliance data, 
searching large scientific data output files, finding files 
with incorrect security ACLs, and resource/capacity 
planning. Table 1 provides examples of some popular 
use cases and the metadata attributes searched. 


2.2 Efficient Metadata Search 


Providing efficient metadata search in large-scale stor- 
age systems is a challenge. While a number of commer- 
cial file metadata search systems exist today [4, 14,17, 
21, 30], these systems focus on smaller scales (e.g., up to 
tens of millions of files) and are often too slow, resource 
intensive, and expensive to be effective for large-scale 
systems. To be effective at large scales, file metadata 
search must provide the following: 

1) Minimal resource requirements. Metadata search 
should not require additional hardware. It should be em- 
bedded within the storage system and close to the files it 
indexes while not degrading system performance. Most 
existing systems require dedicated CPU, memory, and 
disk hardware, making them expensive and hard to de- 
ploy, and limiting their scalability. 

2) Fast metadata collection. Metadata changes must be 
periodically collected from millions to billions of files 
without exhausting or slowing the storage system. Ex- 
isting crawling methods are slow and can tax system re- 
sources. Hooks to notify systems of file changes can add 
overhead to important data paths. 

3) Fast and scalable index search and update. Searches 
must be fast, even as the system grows, or usability may 
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File Management Question Metadata Search Query 


Which files can I migrate to tape? 


How many duplicates of this file are in my home directory? 
Where are my recently modified presentations? 


Which legal compliance files can be expired? 
Which of my files grew the most in the past week? 
How much storage do these users and applications consume? 


size > 50GB, atime > 6 months. 


owner = john, datahash = 0xE431, path = /home/ john. 


owner = john, type = (ppt | keynote), mtime < 2 days. 
retention time = expired, mtime > 7 years 


Top 100 where size(today) > size(1 week ago), owner = john. 


Sum size where owner = john, type = database 





Table 1: Use case examples. Metadata search use cases collected from our user survey. The high-level questions being addressed 
are on the left. On the right are the metadata attributes being searched and example values. Users used basic inode metadata, as 
well as specialized extended attributes, such as legal retention times. Common search characteristics include multiple attributes, 


localization to part of the namespace, and “back-in-time” search. 


suffer. Updates must allow fast periodic re-indexing of 
metadata. However, existing systems typically rely on 
general-purpose relational databases (DBMSs) to index 
metadata. For example, Microsoft’s enterprise search in- 
dexes metadata in their Extensible Storage Engine (ESE) 
database [30]. Unfortunately, DBMSs often use heavy- 
weight locking and transactions that add overhead even 
when disabled [43]. Additionally, their designs make 
significant trade-offs between search and update perfor- 
mance [1]. DBMSs also assume abundant CPU, mem- 
ory, and disk resources. Although standard DBMSs have 
benefited from decades of performance research and op- 
timizations, such as vertical partitioning [23] and materi- 
alized views, their designs are not a perfect fit for meta- 
data search. This is not a new concept; the DBMS com- 
munity has argued that general-purpose DBMSs are not a 
“one size fits all solution” [9, 42, 43], instead saying that 
application-specific designs are often best. 


4) Easy to use search interface. Most systems export 
simple search APIs. However, recent research [3] has 
shown that specially designed interfaces that can pro- 
vide an expressive and easy to use query capabilities can 
greatly improve search experience. 


5) Secure search results. Search results must not allow 
users to find or access restricted files [10]. Existing sys- 
tems either ignore security or enforce it at a significant 
cost to performance. 


We designed Spyglass to address these challenges in 
large-scale storage systems. Spyglass is specially de- 
signed to exploit metadata search properties to achieve 
scale and performance while being embedded within the 
storage system. Spyglass focuses on crawling, updating, 
and searching metadata; interface and security designs 
are left to future work. 


2.3 Metadata Search Properties 


To understand metadata search properties, we analyzed 
results from our user survey and real-world metadata 
snapshot traces collected from storage servers at NetApp. 
We then used this analysis to guide our Spyglass design. 


¥ of Files 


web & wiki server 15 million 1.28 TB 
build space 60 million 30GB 
home directories 300 million | 76.78 TB 


Table 2: Metadata traces collected. The small server capacity of 
the Eng trace is due to the majority of the files being small source 
code files: 99% of files are less than I KB. 


Aiwibute [Description | Atwibute 


inumber inode number file owner 
path full path name file size 
change time 
access time 
mtime hard link # 


Table 3: Attributes used. We analyzed the fields in the inode 
structure and extracted ext values from path. 





ext file extension 
type file or directory 


modification time 





Search Characteristics. From our survey, we observed 
three important metadata search characteristics. First, 
over 95% of searches included multiple metadata at- 
tributes to refine search results; a search on a single at- 
tribute over a large file system can return thousands or 
even millions of results, which users do not want to sift 
through. Second, about 33% of searches were localized 
to part of the namespace, such as a home or project di- 
rectory. Users often have some idea of where their files 
are and a strong idea of where they are not; localizing 
the search focuses results on only relevant parts of the 
namespace. Third, about 25% of the searches that users 
deemed most important searched multiple versions of 
metadata. Users use “back-in-time” searches to under- 
stand file trends and how files are accessed. 
Metadata Characteristics. We collected metadata 
snapshot traces from three storage servers at NetApp. 
Our traces—Web, Eng, and Home—are described in Ta- 
ble 2. Table 3 describes the metadata attributes that we 
analyzed. NetApp servers support extended attributes, 
though they were rarely used in these traces. We found 
two key properties in these traces: metadata has spatial 
locality and highly skewed distributions of values. 
Spatial locality means that attribute values are clus- 
tered in the namespace (i.e., occurring in relatively few 
directories). For example, j ohn’s files reside mostly in 
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(a) Locality Ratio=54% (b) Locality Ratio=38 % 


Figure 1: Examples of locality ratio. Directories that recur- 
sively contain the ext attribute value htm1 are black and gray. 
The black directories contain the value. The locality ratio of ext 
value htm] is 54% (=7/13) in the first tree and 38% (= 5/13) 
in the second tree. The value of ht m1 has better spatial locality 
in the second tree than in the first one. 


S 


the /home/john sub-tree, not scattered evenly across 
the namespace. Spatial locality comes from the way that 
users and applications organize files in the namespace, 
and has been noted in other file system studies [2,25]. 
To measure spatial locality, we use an attribute value’s 
locality ratio: the percent of directories that recursively 
contain the value, as illustrated in Figure 1. A directory 
recursively contains an attribute value if it or any of its 
sub-directories contains the value. The figure on the right 
has a lower locality ratio because the ext attribute value 
html is recursively contained in fewer directories. Ta- 
ble 4 shows the locality ratios for the 32 most frequently 
occurring values for various attributes (ext, size, owner, 
ctime, mtime) in each trace. Locality ratios are less than 
1% for all attributes, meaning that 99% of directories do 
not recursively contain the value. We expect extended at- 
tributes to exhibit similar properties since they are often 
tied to file type and owner attributes. 

Utilizing spatial locality can help prune a query’s 
search space by identifying only the parts of the names- 
pace that contain a metadata value, eliminating a large 
number of files to search. Unfortunately, most general- 
purpose DBMSs treat path names as flat string attributes, 
making it difficult for them to utilize this information, in- 
stead typically requiring them to consider all files for a 
search no matter its locality. 

Metadata values also have highly skewed 
frequencies—their popularity distributions are asym- 
metric, causing a few very popular metadata values to 
account for a large fraction of all total values. This 
distribution has also been observed in other metadata 
studies [2,11]. Figures 2(a) and 2(b) show the distri- 
bution of ext and size values from our Home trace on 
a log-log scale. The linear appearance indicates that 
the distributions are Zipf-like and follow the power law 
distribution [40]. In these distributions, 80% of files 
have one of the 20 most popular ext or size values, 
while the remaining 20% of the files have thousands of 
other values. Figure 2(c) shows the distribution of the 
Cartesian product (i.e., the intersection) of the top 20 
ext and size values. The curve is much flatter, which 
indicates a more even distribution of values. Only 33% 
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of files have one of the top 20 ext and size combinations. 
In Figure 2(c), file percentages for corresponding ranks 
are at least an order of magnitude lower than in the 
other two graphs. This means, for example, that there 
are many files with Owner john and many files with 
ext pdf, but there are often over an order of magnitude 
fewer files with both owner john and ext pdf. 

These distribution properties show that multi-attribute 
searches will significantly reduce the number of query 
results. Unfortunately, most DBMSs rely on attribute 
value distributions (also known as selectivity) to choose 
a query plan. When distributions are skewed, query 
plans often require extra data processing [28]; for ex- 
ample, they may retrieve all of john’s files to find the 
few that are john’s pdf files or vice-versa. Our anal- 
ysis shows that query execution should utilize attribute 
values’ spatial locality rather than their frequency distri- 
butions. Spatial locality provides a more effective way 
to execute a query because it is more selective and can 
better reduce a query’s search space. 


3 Spyglass Design 


Spyglass uses several novel techniques that exploit the 
metadata search properties discussed in Section 2 to pro- 
vide fast, scalable search in large-scale storage systems. 
First, hierarchical partitioning partitions the index based 
on the namespace, preserving spatial locality in the index 
and allowing fine-grained index control. Second, signa- 
ture files [13] are used improve search performance by 
leveraging locality to identify only the partitions that are 
relevant to a query. Third, partition versioning versions 
index updates, which improves update performance and 
allows “back-in-time” search of past metadata versions. 
Finally, Spyglass utilizes storage systems snapshots to 
crawl only the files whose metadata has changed, pro- 
viding fast collection of metadata changes. Spyglass re- 
sides within the storage system and consists of two ma- 
jor components, shown in Figure 3: the Spyglass index, 
which stores metadata and serves queries, and a crawler 
that extracts metadata from the storage system. 


3.1. Hierarchical Partitioning 


To exploit metadata locality and improve scalability, the 
Spyglass index is partitioned into a collection of separate, 
smaller indexes, which we call hierarchical partitioning. 
Hierarchical partitioning is based on the storage system’s 
namespace and encapsulates separate parts of the names- 
pace into separate partitions, thus allowing more flexible, 
finer grained control of the index. Similar partitioning 
strategies are often used by file systems to distribute the 
namespace across multiple machines [35, 44]. 
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0.000162% — 0.120% 0.0579% — 0.177% 0.000194% — 0.0558% |} 0.000291% —0.0105% | 0.000388% — 0.00720% 


0.00101% — 0.264% | 0.00194% — 0.462% | 0.000578% — 0.137% | 0.000453% —0.0103% | 0.000528% — 0.0578% 
0.000201% — 0.491% | 0.0259% — 0.923% 0.000417% — 0.623% 0.000370% — 0.128% 0.000911% — 0.0103% 


Table 4: Locality ratios of the 32 most frequently occurring attribute values. All locality ratios are well below 1%, which means 
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that files with these attribute values are recursively contained in less than 1% of directories. 
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Figure 2: Attribute value distribution examples. A rank of 1 represents the attribute value with the highest file count. The linear 
curves on the log-log scales in Figures 2(a) and 2(b) indicate a Zipf-like distribution, while the flatter curve in Figure 2(c) indicates 


amore even distribution. 
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Figure 3: Spyglass overview. Spyglass resides within the stor- 
age system. The crawler extracts file metadata, which gets 
stored in the index. The index consists of a number of partitions 
and versions, all of which are managed by a caching system. 


Each of the Spyglass partitions is stored sequentially 
on disk, as shown in Figure 4. Thus, unlike a DBMS, 
which stores records adjacently on disk using their row 
or column order, Spyglass groups records nearby in the 
namespace together on disk. This approach improves 
performance since the files that satisfy a query are often 
clustered in only a portion of the namespace, as shown 
by our observations in Section 2. For example, a search 
of the storage system for john’s .ppt files likely does 
not require searching sub-trees such as other user’s home 
directories or system file directories. Hierarchical parti- 
tioning allows only the sub-trees relevant to a search to 
be considered, thereby enabling reduction of the search 
space and improving scalability. Also, a user may choose 
to localize the search to only a portion of the names- 
pace. Hierarchical partitioning allows users to control 
the scope of the files that are searched. A DBMS-based 
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solution usually encodes pathnames as flat strings, mak- 
ing it oblivious to the hierarchical nature of file organiza- 
tion and requiring it to consider the entire namespace for 
each search. If the DBMS stores the files sorted by file 
name, it can improve locality and reduce the fraction of 
the index table that must be scanned; however, this ap- 
proach can still result in performance problems for index 
updates, and does not encapsulate the hierarchical rela- 
tionship between files. 

Spyglass partitions are kept small, on the order of 
100,000 files, to maintain locality in the partition and to 
ensure that each can be read and searched very quickly. 
Since partitions are stored sequentially on disk, searches 
can usually be satisfied with only a few small sequential 
disk reads. Also, sub-trees often grow at a slower rate 
than the system as a whole [2,25], which provides scal- 
ability because the number of partitions to search will 
often grow slower than the size of the system. 

We refer to each partition as a sub-tree partition; the 
Spyglass index is a tree of sub-tree partitions that reflects 
the hierarchical ordering of the storage namespace. Each 
partition has a main partition index, in which file meta- 
data for the partition is stored; partition metadata, which 
keeps information about the partition; and pointers to 
child partitions. Partition metadata includes information 
used to determine if a partition is relevant to a search and 
information used to support partition versioning. 

The Spyglass index is stored persistently on disk; how- 
ever, all partition metadata, which is small, is cached 
in-memory. A partition cache manages the movement 
of entire partition indexes to and from disk as needed. 
When a file is accessed, its neighbor files will likely need 
to be accessed as well, due to spatial locality. Paging en- 
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Figure 4: Hierarchical partitioning example. Sub-tree parti- 
tions, shown in different colors, index different storage system 


sub-trees. Each partition is stored sequentially on disk. The 
Spyglass index is a tree of sub-tree partitions. 





tire partition indexes allows metadata for all of these files 
to be fetched in a single, small sequential read. This con- 
cept is similar to the use of embedded inodes [15], to 
store inodes adjacent to their parent directory on disk. 

In general, Spyglass search performance is a function 
of the number of partitions that must be read from disk. 
Thus, the partition cache’s goal is to reduce disk accesses 
by ensuring that most partitions searched are already in- 
memory. While we know of no studies of file system 
query patterns we believe that a simple LRU algorithm 
is effective. Both web queries [5] and file system ac- 
cess patterns [25] exhibit skewed, Zipf-like popularity 
distributions, suggesting that file metadata queries may 
exhibit similar popularity distributions; this would mean 
that only a small subset of partitions will be frequently 
accessed. An LRU algorithm keeps frequently accessed 
partitions in-memory, ensuring high performance for 
common queries and efficient cache utilization. 
Partition Indexes. Each partition index must provide 
fast, multi-dimensional search of the metadata it in- 
dexes. To do this we use a K-D tree [7], which is a k- 
dimensional binary tree, because it provides lightweight, 
logarithmic point, range, and nearest neighbor search 
over k dimensions and allows multi-dimensional search 
of a partition in tens to hundreds of microseconds. 
However, other index structures can provide additional 
functionality. For example, FastBit [45] provides high 
index compression, Berkeley DB [34] provides trans- 
actional storage, cache-oblivious B-trees [6] improve 
update time, and K-D-B-trees [38] allow partially in- 
memory K-D trees. However, in most cases, the fast, 
lightweight nature of K-D trees is preferred. The draw- 
back is that K-D trees are difficult to update; Section 3.2 
describes techniques to avoid continuous updates. 
Partition Metadata. Partition metadata contains infor- 
mation about the files in the partition, including paths 
of indexed sub-trees, file statistics, signature files, and 
version information. File statistics, such as file counts 
and minimum and maximum values, are kept to answer 
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aggregation and trend queries without having to process 
the entire partition index. These statistics are computed 
as files are being indexed. A version vector, which is de- 
scribed in Section 3.2, manages partition versions. Sig- 
nature files are used to determine if the partition contains 
files relevant to a query. 

Signature files [13] are bit arrays that serve as compact 
summaries of a partition’s contents and exploit metadata 
locality to prune a query’s search space. A common ex- 
ample of a signature file is the Bloom Filter [8]. Spy- 
glass can determine whether a partition may index any 
files that match a query simply by testing bits in the sig- 
nature files. A signature file and an associated hashing 
function are created for each attribute indexed in the par- 
tition. All bits in the signature file are initially set to zero. 
As files are indexed, their attribute values are hashed to 
a bit position in the attribute’s signature file, which is set 
to one. To determine if the partition indexes files relevant 
to a query, each attribute value being searched is hashed 
and its bit position is tested. The partition needs to be 
searched only if all bits tested are set to one. Due to spa- 
tial locality, most searches can eliminate many partitions, 
reducing the number of disk accesses and processing a 
query must perform. 

Due to collisions in the hashing function that cause 
false positives, a signature file determines only if a par- 
tition may contain files relevant to a query, potentially 
causing a partition to be searched when it does not con- 
tain any files relevant to a search. However, signature 
files cannot produce false negatives, so partitions with 
relevant files will never be missed. False-positive rates 
can be reduced by varying the size of the signature or 
changing the hashing function. Increasing signature file 
sizes, which are initially around 2KB, decreases the 
chances of a collision by increasing the total number of 
bits. This trades off increased memory requirements and 
lower false positive rates. Changing the hashing function 
allow a bit’s meaning and how it is used to be improved. 
For example, consider a signature file for file size at- 
tributes. Rather than have each bit represent a single size 
value (e.g., 522 bytes), we can reduce false positives for 
common small files by mapping each | KB range to a 
single bit for sizes under | MB. The ranges for less com- 
mon large files can be made more coarse, perhaps using 
a single bit for sizes between 25 and 50 MB. 

The number of signature files that have to be tested can 
be reduced by utilizing the tree structure of the Spyglass 
index to create hierarchically defined signature files. Hi- 
erarchical signature files are smaller signatures (roughly 
100 bytes) that summarize the contents of its partition 
and the partitions below it in the tree. Hierarchical signa- 
ture files are the logical OR of a partition’s signature files 
and the signature files of its children. A single failed test 
of a hierarchical signature file can eliminate huge parts of 


USENIX Association 


USENIX Association 










jim ae reliability 


src experiments 
\ 
/ 


Spyglass 
\ f 


; i : indexer 
\ ! | 
\ | \ 
1 Ya 
| 
| ba | 
\ 


Baseline Incremental 
index indexes 





Figure 5: Versioning partitioning example. Each sub-tree 
partition manages its own versions. A baseline index is a nor- 
mal partition index from some initial time Ty. Each incremental 
index contains the changes required to roll query result forward 
to a new point in time. Each sub-tree partition manages its ver- 
sion in a version vector. 


the index from the search space, preventing every parti- 
tion’s signature files from being tested. Hierarchical sig- 
nature files are kept small to save memory at the cost of 
increased false positives. 


3.2 Partition Versioning 


Spyglass improves update performance and enables 
“back-in-time” search using a technique called parti- 
tion versioning that batches index updates, treating each 
batch as a new incremental index version. The motiva- 
tion for partition versioning is two-fold. First, we wish 
to improve index update performance by not having to 
modify existing index structures. In-place modification 
of existing indexes can generate large numbers of disk 
seeks and can cause partition index structures to become 
unbalanced. Second, back-in-time search can help an- 
swer many important storage management questions that 
can track file trends and how they change. 

Spyglass batches updates before they are applied as 
new versions to the index, meaning that the index may 
be stale because file modifications are not immediately 
reflected in the index. However, batching updates im- 
proves index update performance by eliminating many 
small, random, and frequent updates that can thrash the 
index and cache. Additionally, from our user survey, 
most queries can be satisfied with a slightly stale index. 
It should be noted that partition versioning does not re- 
quire updates to be batched. The index can be updated in 
real time by versioning each individual file modification, 
as is done in most versioning file systems [39, 41]. 
Creating Versions. Spyglass versions each sub-tree par- 
tition individually rather than the entire index as a whole 
in order to maintain locality. A versioned sub-tree par- 
tition consists of two components: a baseline index and 


incremental indexes, which are illustrated in Figure 5. A 
baseline index is a normal partition index that represents 
the state of the storage system at time 70, or the time of 
the initial update. An incremental index is an index of 
metadata changes between two points in time 7,—; and 
T,. These changes are indexed in K-D trees, and smaller 
signature files are created for each incremental index. 
Storing changes differs from the approach used in some 
versioning file systems [39], which maintain full copies 
for each version. Changes consist of metadata creations, 
deletions, and modifications. Maintaining only changes 
requires a minimal amount of storage overhead, resulting 
in a smaller footprint and less data to read from disk. 


Each sub-tree partition starts with a baseline index, as 
shown in Figure 5. When a batch of metadata changes 
is received at 71, it is used to build incremental indexes. 
Each partition manages its incremental indexes using a 
version vector, similar in concept to inode logs in the 
Elephant File System [39]. Since file metadata in differ- 
ent parts of the file system change at different rates [2, 
25], partitions may have different numbers and sizes of 
incremental indexes. Incremental indexes are stored se- 
quentially on disk adjacent to their baseline index. As a 
result, updates are fast because each partition writes its 
changes in a single, sequential disk access. Incremen- 
tal indexes are paged into memory whenever the base- 
line index is accessed, increasing the amount of data that 
must be read when paging in a partition, though not typi- 
cally increasing the number of disk seeks. As a result, the 
overhead of versioning on overall search performance is 
usually small. 


Performing a “back-in-time” search that is accurate as 
of time TJ, works as follows. First, the baseline index 
is searched, producing query results that are accurate as 
of Ty. The incremental indexes 7; through T,, are then 
searched in chronological order. Each incremental in- 
dex searched produces metadata changes that modify the 
search results, rolling them forward in time, and even- 
tually generating results that are accurate as of T,. For 
example, consider a query for files with owner john 
that matches two files, F, and F,, at Ty. A search of in- 
cremental indexes at T; may yield changes that cause F, 
to no longer match the query (e.g., no longer owned by 
john), and a later search of incremental indexes at T), 
may yield changes that cause file F, to match the query 
(i.e., now owned by john). The results of the query are 
F, and F,, which is accurate as of T,. Because this pro- 
cess is done in memory and each version is relatively 
small, searching through incremental indexes is often 
very fast. In rolling results forward, a small penalty is 
paid to search the most recent changes; however, updates 
are much faster because no data needs to be copied, as 
is the case in CVFS [41], which rolls version changes 
backwards rather than forwards. 
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Managing Versions. Over time, older versions tends 
to decrease in value and should be removed to re- 
duce search overhead and save space. Spyglass pro- 
vides two efficient techniques for managing partition 
versions: version collapsing and version checkpointing. 
Version collapsing applies each partition’s incremental 
index changes to its baseline index. The result is a single 
baseline for each partition that is accurate as of the most 
recent incremental index. Collapsing is efficient because 
all original index data is read sequentially and the new 
baseline is written sequentially. Version checkpointing 
allows an index to be saved to disk as a new copy to pre- 
serve an important landmark version of the index. 

We describe how collapsing and checkpointing can be 
used with an example. During the day, Spyglass is up- 
dated hourly, creating new versions every hour, thus al- 
lowing “back-in-time” searches to be performed at per- 
hour granularity over the day. At the end of each day, 
incremental versions are collapsed, reducing space over- 
head at the cost of prohibiting hour-by-hour searching 
over the last day. Also, at the end of each day, a copy 
of the collapsed index is checkpointed to disk, represent- 
ing the storage system state at the end of each day. At 
the end of each week, all but the latest daily checkpoints 
are deleted; and at the end of each month, all but the lat- 
est weekly checkpoints are deleted. This results in ver- 
sions of varying time scales. For example, over the past 
day any hour can be searched, over the past week any 
day can be searched, and over the past month any week 
can be searched. The frequency for index collapsing and 
checkpointing can be configured based on user needs and 
space constraints. 


3.3 Collecting Metadata Changes 


The Spyglass crawler takes advantage of NetApp 


Snapshot“ technology in the NetApp WAFL ® file 
system [19] on which it was developed to quickly collect 
metadata changes. Given two snapshots, T,_; and T), 
Spyglass calculates the difference between them. This 
difference represents all of the file metadata changes be- 
tween T,_; and T,. Because of the way snapshots are 
created, only the metadata of changed files is re-crawled. 

All metadata in WAFL resides in a single file called 
the inode file, which is a collection of fixed length inodes. 
Extended attributes are included in the inodes. Perform- 
ing an initial crawl of the storage system is fast because 
it simply involves sequentially reading the inode file. 
Snapshots are created by making a copy-on-write clone 
of the inode file. Calculating the difference between two 
snapshots leverages this mechanism. This is shown in 
Figure 6. By looking at the block numbers of the inode 
file’s indirect and data blocks, we can determine exactly 
which blocks have changed. If a block’s number has not 
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Figure 6: Snapshot-based metadata collection. In snapshot 2, 
block 7 has changed since snapshot 1. This change is propa- 
gated up the tree. Because block 2 has not changed, we do not 
need to examine it or any blocks below it. 


changed, then it does not need to be crawled. If this block 
is an indirect block, then no blocks that it points to need 
to be crawled either because block changes will propa- 
gate all the way back up to the inode file’s root block. As 
a result, the Spyglass crawler can identify just the data 
blocks that have changed and crawl only their data. This 
approach greatly enhances scalability because crawl per- 
formance is a function of the number of files that have 
changed rather than the total number of files. 


Spyglass is not dependent on snapshot-based crawl- 
ing, though it provides benefits compared to alterna- 
tive approaches. Periodically walking the file system 
can be extremely slow because each file must be tra- 
versed. Moreover, traversal can utilize significant sys- 
tem resources and alter file access times on which file 
caches depend. Another approach, file system event noti- 
fications (e. g., inotify [22]), requires hooks into crit- 
ical code paths, potentially impacting performance. A 
changelog, such as the one used in NTFS, is another al- 
ternative; however, since we are not interested in every 
system event, a snapshot-based scheme is more efficient. 


3.4 Distributed Design 


Our discussion thus far has focused on indexing and 
crawling on a single storage server. However, large-scale 
storage systems are often composed of tens or hundreds 
of servers. While we do not currently address how to dis- 
tribute the index, we believe that hierarchical partitioning 
lends itself well to a distributed environment because the 
Spyglass index is a tree of partitions. A distributed file 
system with a single namespace can view Spyglass as 
a larger tree composed of partitions placed on multiple 
servers. As a result, distributing the index is a matter of 
effectively scaling the Spyglass index tree. Also, the use 
of signature files may be effective at routing distributed 
queries to relevant servers and their sub-trees. Obviously, 
there are many challenges to actually implementing this. 
A complete design is left to future work. 
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4 Experimental Evaluation 


We evaluated our Spyglass prototype to determine how 
well our design addresses the metadata search challenges 
described in Section 2 for varying storage system sizes. 
To do this, we first measured metadata collection speed, 
index update performance, and disk space usage. We 
then analyzed search performance and how effectively 
index locality is utilized. Finally, we measured partition 
versioning overhead. 

Implementation Details. Our Spyglass prototype was 
implemented as a user-space process on Linux. An RPC- 
based interface to WAFL gathers metadata changes using 
our snapshot-based crawler. Our prototype dynamically 
partitions the index as it is being updated. As files and 
directories are inserted into the index, they are placed 
into the partition with the longest pathname match (i.e., 
the pathname match farthest down the tree). New par- 
titions are created when a directory is inserted and all 
matching partitions are full. A partition is considered 
full when it contains over 100,000 files. We use 100,000 
as the soft partition limit in order to ensure that parti- 
tions are small enough to be efficiently read and written 
to disk. Using a much smaller partition size will often 
increase the number of partitions that must be accessed 
for a query; this incurs extra expensive disk seeks. Us- 
ing a much larger partition size decreases the number of 
partitions that must be accessed for a query; however it 
poorly encapsulates spatial locality, causing extra data to 
be read from disk. In the case of symbolic and hard links, 
multiple index entries are used for the file. 

During the update process, partitions are buffered in- 
memory and written sequentially to disk when full; each 
is stored in a separate file. K-D trees were implemented 
using libkdtree++ [27]. Signature file bit-arrays 
are about 2 KB, but hierarchical signature files are only 
100 bytes, ensuring that signature files can fit within 
our memory constraints. Hashing functions that allowed 
each signature file’s bit to correspond to a range of values 
were used for file size and time attributes to reduce false 
positive rates. When incremental indexes are created, 
they are appended to their partition on disk. Finally, we 
implement a simple search API that allows point, range, 
top-k, and aggregation searches. We plan to extend this 
interface as future work. 

Experimental Setup. We evaluated performance us- 
ing our real-world metadata traces described in Table 2. 
These traces have varying sizes, allowing us to exam- 
ine scalability. Our Web and Eng traces also have in- 
cremental snapshot traces of daily metadata changes for 
several days. Since no standard benchmarks exist, we 
constructed synthetic sets of queries, discussed later in 
this section, from our metadata traces to evaluate search 
performance. All experiments were performed on a dual 


core AMD Opteron machine with 8 GB of main memory 
running Ubuntu Linux 7.10. All index files were stored 
on a network partition that accessed a high-end NetApp 
file server over NFS. 

We also evaluated the performance of two popular re- 
lational DBMSs, PostgreSQL and MySQL, which serve 
as relative comparison points to DBMS-based solutions 
used in other metadata search systems. The goal of our 
comparison is to provide some context to frame our Spy- 
glass evaluation, not to compare performance to the best 
possible DBMS setup. We compared Spyglass to an 
index-only DBMS setup, which is used in several com- 
mercial metadata search systems, and also tuned various 
options, such as page size, to the best of our ability. This 
setup is effective at pointing out several basic DBMS 
performance problems. DBMS performance can be im- 
proved through the techniques discussed in Section 2; 
however, as stated earlier, they do not completely match 
metadata search cost and performance requirements. 

Our Spyglass prototype indexes the metadata at- 
tributes listed in Table 3. Our index-only DBMSs in- 
clude a base relation with the same metadata attributes 
and a B+-tree index for each. Each B+-tree indexes ta- 
ble row ID. An index-only design reduces space usage 
compared to some more advanced setups, though it has 
slower search performance. In all three traces, cache 
sizes were configured to 128 MB, 512 MB, and 2.5 GB 
for the Web, Eng, and Home traces, respectively. These 
sizes are small relative to the size of their trace and cor- 
respond to about | MB for every 125,000 files. 
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Figure 7: Metadata collection performance. We compare 
Spyglass’s snapshot-based crawler (SB) to a straw-man design 
(SM). Our crawler has good scalability; performance is a func- 
tion of the number of changed files rather than system size. 


Metadata Collection Performance. We first evaluated 
our snapshot-based metadata crawler and compared it 
to a straw-man approach. Fast collection performance 
impacts how often updates occur and system resource 
utilization. Our straw-man approach performs a paral- 
lelized walk of the file system using stat () to ex- 
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Figure 8: Update performance. The time required to build an 
initial baseline index shown on a log-scale. Spyglass updates 
quickly and scales linearly because updates are written to disk 
mostly sequentially. 


tract metadata. Figure 7(a) shows the performance of a 
baseline crawl of all file metadata. Our snapshot based 
crawler is up to 10x faster than our straw-man for 100 
million files because our approach simply scans the in- 
ode file. As a result, a 100 million file system is crawled 
in less than 20 minutes. 


Figure 7(b) shows the time required to collect incre- 
mental metadata changes. We examine systems with 2%, 
5%, and 10% of their files changed. For example, a 
baseline of 40 million files and 5% change has 2 million 
changed files. For the 100 million file tests, each of our 
crawls finishes in under 45 minutes, while our straw-man 
takes up to 5 hours. Our crawler is able to crawl the inode 
file at about 70,000 files per second. Our crawler effec- 
tively scales because we incur only a fractional overhead 
as more files are crawled; this is due to our crawling only 
changed blocks of the inode file. 


Update Performance. Figure 8 shows the time re- 
quired to build the initial index for each of our metadata 
traces. Spyglass requires about 4 minutes, 20 minutes, 
and 100 minutes for the three traces, respectively. These 
times correspond to a rate of about 65,000 files per sec- 
ond, indicating that update performance scales linearly. 
Linear scaling occurs because updates to each partition 
are written sequentially, with seeks occurring only be- 
tween partitions. Incremental index updates have a sim- 
ilar performance profile because metadata changes are 
written in the same fashion and few disk seeks are added. 
Our reference DBMSs take between 8 x and 44x longer 
to update because DBMSs require loading their base ta- 
ble and updating index structures. While loading the ta- 
ble is fast, updating index structures often requires seeks 
back to the base table or extra data copies. As a result, 
DBMS updates with our Home trace can take a day or 
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Figure 9: Space overhead. The index disk space requirements 
shown on a log-scale. Spyglass requires just 0.1% of the Web 
and Home traces and 10% of the Eng trace to store the index. 


more; however, approaches such as cache-oblivious B- 
trees [6] may be able to reduce this gap. 

Space Overhead. Figure 9 shows the disk space usage 
for all three of our traces. Efficient space usage has two 
primary benefits: less disk space taken from the storage 
system and the ability to cache a higher fraction of the 
index. Spyglass requires less than 0.1% of the total disk 
space for the Web and Home traces. However, it requires 
about 10% for the Eng trace because the total system size 
is low due to very small files. Spyglass requires about 50 
bytes per file across all traces, resulting in space usage 
that scales linearly with system size. Space usage in Spy- 
glass is 5 x—8 x lower than in our references DBMSs be- 
cause they require space to store the base table and index 
structures. Figure 9 shows that building index structures 
can more the double the total space requirements. 
Search Performance. To evaluate Spyglass search per- 
formance, we generated sets of queries derived from real- 
world queries in our user study; there are, unfortunately, 
no standard benchmarks for file system search. These 
query sets are summarized in Table 5. Our first set is 
based on a storage administrator searching for the user 
and application files that are consuming the most space 
(e.g., total size of john’s . vmdk files)—an example of 
a simple two-attribute search. The second set is an ad- 
ministrator localizing the same search to only part of 
the namespace, which shows how localizing the search 
changes performance. The third set is a storage user 
searching for recently modified files of a particular type 
in a specific sub-tree, demonstrating how searching many 
attributes impacts performance. Each query set consists 
of 100 queries, with attribute values randomly selected 
from our traces. Randomly selecting attribute values 
means that our query sets loosely follow the distribution 
of values in our traces and that a variety of values are 
used. 
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Search 
Which user and application files consume the most space? 
How much space, in this part of the system, do files from query 1 consume? 
What are the recently modified application files in my home directory? 


Metadata Attributes 
Sum sizes for files using owner and ext. 
Use query 1 with an additional directory path. 


ores 
Retrieve all files using mtime, owner, ext, and path. 


USENIX Association 





Table 5: Query Sets. A summary of the searches used to generate our evaluation query sets. 
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Figure 10: Query set run times. The total time required to run each set of queries. Each set is labeled 1 through 3 and is clustered 
by trace file. Each trace is shown on a separate log-scale axis. Spyglass improves performance by reducing the search space to a 
small number of partitions, especially for query sets 2 and 3, which are localized to only a part of the namespace. 


Figure 10 shows the total run times for each set of 
queries. In general, query set 1 takes Spyglass the longest 
to complete, while query sets 2 and 3 finish much faster. 
This performance difference is caused by the ability of 
sets 2 and 3 to localize the search to only a part of the 
namespace by including a path with the query. Spyglass 
is able to search only files from this part of the storage 
system by using hierarchical partitioning. As a result, 
the search space for these queries is bound to the size 
of the sub-tree, no matter how large the storage system. 
Because the search space is already small, using many at- 
tributes has little impact on performance for set 3. Query 
set 1, on the other hand, must consider all partitions and 
tests each partition’s signature files to determine which 
to search. While many partitions are eliminated, there 
are more partitions to search than in the other query sets, 
which accounts for the longer run times. 


Our comparison DBMSs perform closer to Spyglass 
on our smallest trace, Web; however we see the gap 
widen as the system size increases. In fact, Spyglass is 
over four orders of magnitude faster for query sets 2 and 
3 on our Home trace, which is our largest at 300 mil- 
lion files. The large performance gap is due to several 
reasons. First, our DBMSs consider files from all parts 
of the namespace, making the search space much larger. 
Second, skewed attribute value distributions cause our 
DBMSs to process extra data even when there are few 
results. Third, the DBMSs base tables ignore metadata 
locality, causing extra disk seeks to find files close in the 
namespace but far apart in the table. Spyglass, on the 
other hand, uses hierarchical partitioning to significantly 
reduce the search space, performs only small, sequential 
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disk accesses, and can exploit locality in the workload to 
greatly improve cache utilization. 


Using the results from Figure 10, we calculated query 
throughput, shown in Table 6. Query throughput (queries 
per second) provides a normalized view of our results 
and the query loads that can be achieved. Spyglass 
achieves throughput of multiple queries per second in all 
but two cases; in contrast, the reference DBMSs do not 
achieve one query per second in any instance, and, in 
many cases, cannot even sustain one query per five min- 
utes. Figure 11 shows an alternate view of performance; 
a cumulative distribution function (CDF) of query exe- 
cution times on our Home trace, allowing us to see how 
each query performed. In query sets 2 and 3, Spyglass 
finishes all searches in less than a second because local- 
ized searches bound the search space. For query set 1, 
we see that 75% of queries take less than one second, 
indicating that most queries are fast and that a few slow 
queries contribute significantly to the total run times in 
Figure 10. These queries take longer because they must 
read many partitions from disk, either because few were 
previously cached or many partitions are searched. 


Index Locality. We now evaluate how well Spyglass ex- 
ploits spatial locality to improve query performance. We 
generated another set of queries, based on query | from 
Table 5, with 500 queries with owner and ext values ran- 
domly selected from our Eng trace. We generated similar 
query sets for individual ext and owner attributes. 


Figure 12(a) shows a CDF of the fraction of partitions 
searched. Searching more partitions often increases the 
amount of data that must be read from disk, which de- 
creases performance. We see that 50% of searches using 
just the ext attribute reference fewer than 75% of par- 
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Table 6: Query throughput. We use the results from Figure 10 to calculate query throughput (queries per second). We find that 
Spyglass can achieve query throughput that enables fast metadata search even on large-scale storage systems. 
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(c) Query set 3. 


Figure 11: Query execution times. A CDF of query set execution times for the Eng trace. In Figures 11(b) and 11(c), all queries 
are extremely fast because these sets include a path predicate that allows Spyglass to narrow the search to a few partitions. 
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(b) CDF of partition cache hits. 


Figure 12: Index locality. A CDF of the number of partitions 
accessed and the number of accesses that were cache hits for 
our query set. Searching multiple attributes reduces the number 
of partition accesses and increases cache hits. 


titions. However, 50% of searches using both ext and 
owner together reference fewer than 2% of the parti- 
tions, since searching more attributes increases the lo- 
cality of the search, thereby reducing the number of par- 
titions that must be searched. Figure 12(b) shows a CDF 
of cache hit percentages for the same set of queries. 
Higher cache hit percentages means that fewer partitions 
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Figure 13: Versioning overhead. The figure on the left shows 
total run time for a set of 450 queries. Each version adds about 
10% overhead. On the right, a CDF shows per-query over- 
heads. Over 50% of queries have an overhead of 5 ms or less. 


are read from disk. Searching owner and ext attributes 
together results in 95% of queries having a cache hit per- 
centage of 95% or better due to the higher locality ex- 
hibited by multi-attribute searches. The higher locality 
causes repeated searches in the sub-trees where these 
files reside and allows Spyglass to ignore more non- 
relevant partitions. 

Versioning Overhead. To measure the search overhead 
added by partition versioning, we generated 450 queries 
based on query | from Table 5 with values randomly se- 
lected from our Web trace. We included three full days 
of incremental metadata changes, and used them to per- 
form three incremental index updates. Figure 13 shows 
the time required to run our query set with an increasing 
number of versions; each version adds about a 10% over- 
head to the total run time. However, the overhead added 
to most queries is quite small. Figure 13 also shows, via 
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a CDF of the query overheads incurred for each version, 
that more than 50% of the queries have less than a 5 ms 
overhead. Thus, it is a few much slower queries that con- 
tribute to most of the 10% overhead. This behavior oc- 
curs because overhead is typically incurred when incre- 
mental indexes are read from disk, which doesn’t occur 
once a partition is cached. Since reading extra versions 
does not typically incur extra disk seeks, the overhead 
for the slower queries is mostly due to reading partitions 
with much larger incremental indexes from disk. 


5 Related Work 


Spyglass seeks to improve how file systems manage 
growing volumes of data, which has been an important 
challenge and an active area of research for over two 
decades. A significant amount of work has looked at 
how file systems can improve file naming and organi- 
zation by leveraging file attributes. The Semantic File 
System [16] utilized file (attribute,value) pairs to dy- 
namically construct a namespace based on queries rather 
than use a standard hierarchical namespace. Virtual di- 
rectories allowed queries to be integrated directly into the 
namespace as a directory containing search results. The 
Hierarchy and Content (HAC) [18] file system looked 
as how Semantic File System concepts could be applied 
to a hierarchical namespace, providing users with a new 
naming mechanism without requiring them to forgo tra- 
ditional hierarchies. These and similar systems [32, 36] 
focus on how users name and view files, though they do 
not focus on how files are actually indexed and searched, 
thereby potentially limiting their performance and scal- 
ability. While Spyglass does not provide higher level 
naming semantics, it is the first to address the challenge 
of scalable file metadata indexing and search, allowing it 
to potentially be used as the underlying indexing method 
for such file systems. 

Spyglass focuses on how to exploit file metadata prop- 
erties to improve search performance and scalability, 
though it is not the first to look at how new indexing 
structures improve file retrieval. Inversion [33] used a 
general-purpose DBMS as the core file system structure, 
rather than traditional file system inode and data layouts. 
Inversion used several PostgreSQL tables to store both 
file metadata and data, allowing the file system to benefit 
from database transaction and recovery support and al- 
lowing metadata and data to be queried. Like Spyglass, 
Inversion provides ad hoc metadata query functionality, 
though it focuses on allowing file systems to leverage 
database functionality rather than on query performance. 

However, a number of new index designs have been 
proposed to improve various aspects of file system 
search. GLIMPSE [29] reduced disk space requirements, 
compared to a normal full-text inverted index, by main- 


taining only a partial inverted index that does not store 
the location of every term occurrence. Like Spyglass, 
GLIMPSE partitioned the search space, using fixed size 
blocks of the file space, which were then referenced by 
the partial inverted index. A tool similar to grep was 
used to find exact term locations with each fixed size 
block. Similarly, Diamond [20] eliminated disk space re- 
quirements by using a mechanism to improve the speed 
of brute force searches instead of maintaining an index. 
A technique called Early Discard allowed files that are 
irrelevant to the search to be rejected as early as possi- 
ble, helping to reduce the search space. Early Discard 
used application-specific “searchlets” to determine when 
a file is irrelevant to a given query. Geometric partition- 
ing [24] aimed to improve inverted index update perfor- 
mance by breaking up the inverted index’s inverted lists 
by update time. The most recently updated inverted lists 
were kept small and sequential, allowing future updates 
to be applied quickly. A merging algorithm created new 
partitions as the lists grow over time. Query-based par- 
titioning [31] used a similar approach, though it parti- 
tioned the inverted index based on file search frequency, 
allowing index data for infrequently searched files to be 
offloaded to second-tier storage to improve cost. 


6 Conclusions and Future Work 


As storage systems have become larger, finding and man- 
aging files has become increasingly difficult. To address 
this problem we presented Spyglass, a metadata search 
system that improves file management by allowing com- 
plex, ad hoc queries over file metadata. Spyglass in- 
troduces several novel indexing techniques that improve 
metadata crawling, search, and update performance by 
exploiting metadata properties. Our evaluation shows 
that Spyglass has up to 1-4 orders of magnitude faster 
search performance then existing designs. 

We plan on improving Spyglass in the future in a num- 
ber of ways. First, we plan on addressing file security 
by leveraging hierarchical partitioning to help eliminate 
partitions that the user does not have access to from the 
search space. Second, we are exploring new interface 
and query language designs that allow users to ask com- 
plex queries (e.g., “back-in-time” queries) while remain- 
ing easy to use. Third, we propose fully distributing Spy- 
glass across a cluster by allowing partitions to be repli- 
cated and migrated across machines. Fourth, we will ex- 
plore how partitioning can be improved by using other 
metadata attributes to partition the index. Finally, we are 
looking at how Spyglass can be used as the main meta- 
data store for a storage system, eliminating many of the 
space and performance overheads incurred when used in 
addition to the storage system’s metadata store. 
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Perspective: Semantic data management for the home 
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Abstract 


Perspective is a storage system designed for the home, 
with the decentralization and flexibility sought by home 
users and a new semantic filesystem construct, the view, 
to simplify management. A view is a semantic descrip- 
tion of a set of files, specified as a query on file attributes, 
and the ID of the device on which they are stored. By ex- 
amining and modifying the views associated with a de- 
vice, a user can identify and control the files stored on 
it. This approach allows users to reason about what is 
stored where in the same way (semantic naming) as they 
navigate their digital content. Thus, in serving as their 
own administrators, users do not have to deal with a sec- 
ond data organization scheme (hierarchical naming) to 
perform replica management tasks, such as specifying 
redundancy to increase reliability and data partitioning 
to address device capacity exhaustion. Experiences with 
Perspective deployments and user studies confirm the ef- 
ficacy of view-based data management. 


1 Introduction 


Distributed storage is coming home. An increasing num- 
ber of home and personal electronic devices create, use, 
and display digitized forms of music, images, videos, as 
well as more conventional files (e.g., financial records 
and contact lists). In-home networks enable these de- 
vices to communicate, and a variety of device-specific 
and datatype-specific tools are emerging. The transi- 
tion to digital homes gives exciting new capabilities to 
users, but it also makes them responsible for administra- 
tion tasks usually handled by dedicated professionals in 
other settings. It is unclear that traditional data manage- 
ment practices will work for “normal people” reluctant 
to put time into administration. 


This paper presents the Perspective distributed filesys- 
tem, part of an expedition into this new domain for 
distributed storage. As with previous expeditions into 
new computing paradigms, such as distributed operat- 
ing systems (e.g., [23, 27]) and ubiquitous computing 
(e.g., [41]), we are building and utilizing a system rep- 
resenting the vision in order to gain experience. In this 
case, however, the researchers are not representative of 
the user population. Most will be non-technical people 
who just want to use the system, but must (begrudgingly) 
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deal with administration tasks or live with the conse- 
quences. Thus, organized user studies will be required 
as complements to systems experimentation. 


Perspective’s design is motivated by a contextual anal- 
ysis and early deployment experiences [31]. Our inter- 
actions with users have made clear the need for decen- 
tralization, selective replication, and support for device 
mobility and dynamic membership. An intriguing les- 
son is that home users rarely organize and access their 
data via traditional hierarchical naming—usually, they 
do so based on data attributes. Computing researchers 
have long talked about attribute-based data navigation 
(e.g., semantic filesystems [12, 36]), while continuing 
to use directory hierarchies. However, users of home 
and personal storage live it. Popular interfaces (e.g., 
iTunes, iPhoto, and even drop-down lists of recently- 
opened Word documents) allow users to navigate file col- 
lections via attributes like publisher-provided metadata, 
extracted keywords, and date/time. Usually, files are still 
stored in underlying hierarchical file systems, but users 
often are insulated from naming at that level and are 
oblivious to where in the namespace given files end up. 


Users have readily adopted these higher-level navigation 
interfaces, leading to a proliferation of semantic data lo- 
cation tools [42, 3, 13, 37, 19]. In contrast, the abstrac- 
tions provided by filesystems for managing files have re- 
mained tightly tied to hierarchical namespaces. For ex- 
ample, most tools require that specific subtrees be identi- 
fied, by name or by “volumes” containing them, in order 
to perform replica management tasks, such as partition- 
ing data across computers for capacity management or 
specifying that multiple copies of certain data be kept for 
reliability. Since home users double as their own system 
administrators, this disconnect between interface styles 
(semantic for data access activities and hierarchical for 
management tasks) naturally creates difficulties. 


The Perspective distributed filesystem allows a collec- 
tion of devices to share storage without requiring a cen- 
tral server. Each device holds a subset of the data and 
can access data stored on any other (currently connected) 
device. However, Perspective does not restrict the sub- 
set stored on each device to traditional volumes or sub- 
trees. To correct the disconnect between semantic data 
access and hierarchical replica management, Perspective 
replaces the traditional volume abstraction with a new 
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primitive we call a view. A view is a compact descrip- 
tion of a set of files, expressed much like a search query, 
and a device on which that data should be stored. For 
example, one view might be “all files with type=music 
and artist=Beatles stored on Liz’s iPod” and another “all 
files with owner=Liz stored on Liz’s laptop”. Each de- 
vice participating in Perspective maintains and publishes 
one or more views to describe the files that it stores. Per- 
spective ensures that any file that matches a view will 
eventually be stored on the device named in the view. 


Since views describe sets of files using the same 
attribute-based style as users’ other tools, view-based 
management replica management is easier than hierar- 
chical file management. A user can see what is stored 
where, in a human-readable fashion, by examining the 
set of views in the system. She can control replication 
and data placement by changing the views of one or more 
devices. Views allow sets of files to overlap and to be de- 
scribed independently of namespace structure, removing 
the need for users to worry about application-internal file 
naming decisions or unfortunate volume boundaries. Se- 
mantic management can also be useful for local manage- 
ment tasks, such as setting file attributes and security, in 
addition to replica management. In addition to anecdo- 
tal experiences, an extensive lab study of 30 users each 
performing 10 different management tasks confirms that 
view-based management is easier for users than volume- 
based management. 


This paper describes view-based management and our 
Perspective prototype, which combines existing tech- 
nologies with several new algorithms to implement view- 
based distributed storage. View-based data placement 
and view freshness allow Perspective to manage and 
expose data mobility with views. Distributed update 
rules allow Perspective to ensure and expose permanence 
with views (which can be thought of as semantically- 
defined volumes). Perspective introduces overlap trees 
as a mechanism for reasoning about how many replicas 
exist of a particular dataset, and where these files are 
stored, even when no view exactly matches the attributes 
of the dataset. 


Our Perspective prototype is a user-level filesystem 
which runs on Linux and OS X. In our deployments, Per- 
spective provides normal file storage as well as being the 
backing store for iTunes and MythTV in one household 
and in our research environment lounge. Experiments 
with the Perspective prototype confirm that it can provide 
consistent, decentralized storage with reasonable perfor- 
mance. Even with its application-level implementation 
(connected to the OS via FUSE [10]), Perspective per- 
formance is within 3% of native filesystem performance 
for activities of interest. 
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2 Storage for the home 


Storage has become a component of many consumer 
electronics. Currently, most stored content is captive 
to individual devices (e.g., DVRs, digital cameras, dig- 
ital picture frames, and so on), with human-intensive and 
proprietary interfaces (if any) for copying it to other de- 
vices. But, we expect a rapid transition toward exploiting 
wireless home networking to allow increased sharing of 
content across devices. Thus, we are exploring how to 
architect a distributed storage system for the home. 


The home is different from an enterprise. Most notably, 
there are no sysadmins—household members generally 
deal with administration (or don’t) themselves. The users 
also interact with their home storage differently, since 
most of it is for convenience and enjoyment rather than 
employment. However, much of the data stored in home 
systems, such as family photos, is both important and ir- 
replaceable, so home storage systems must provide high 
levels of reliability in spite of lax management practices. 
Not surprisingly, we believe that home storage’s unique 
requirements would be best served by a design differ- 
ent than enterprise storage. This section outlines insights 
gained from studying use of storage in real homes and 
design features suggested by them. 


2.1 What users want 


A contextual analysis is an HCI research technique that 
provides a wealth of in-situ data, perspectives, and real- 
world anecdotes of the use of technology. It consists of 
interviews conducted in the context of the environment 
under study. To better understand home storage, we ex- 
tensively interviewed all members of eight households 
(24 people total), in their homes and with all of their stor- 
age devices present. We have also gathered experiences 
from early deployments in real homes. This section lists 
some guiding insights (with more detailed information 
available in technical reports [31]). 


Decentralized and Dynamic: The users in our study 
employed a wide variety of computers and devices. 
While it was not uncommon for them to have a set of pri- 
mary devices at any given point in time, the set changed 
rapidly, the boundaries between the devices were porous, 
and different data was “homed” on different devices with 
no central server. One household had set up a home 
server, at one point, but did not re-establish it when they 
upgraded the machine due to setup complexity. 


Money matters: While the cost of storage continues 
to decrease, our interviews showed that cost remains a 
critical concern for home users. Note that our studies 
were conducted well before the Fall 2008 economic cri- 
sis. While the same is true of enterprises, home stor- 
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age rarely has a clear “return on investment,” and the 
cost is instead balanced against other needs (e.g., new 
shoes for the kids) or other forms of enjoyment. Thus, 
users replicate selectively, and many adopted cumber- 
some data management strategies to save money. 


Semantic naming: Most users navigated their data via 
attribute-based naming schemes provided by their ap- 
plications, such as iPhoto, iTunes, and the like. Of 
course, these applications stored the content into files 
in the underlying hierarchical file system, but users 
rarely knew where. This disconnect created problems 
when they needed to make manual copies or configure 
backup/synchronization tools. 


Need to feel in control: Many approaches to manage- 
ability in the home tout automation as the answer. While 
automation is needed, the users expressed a need to 
understand and sometimes control the decisions being 
made. For example, only 2 of the 14 users who backed up 
data used backup tools. The most commonly cited rea- 
son was that they did not understand what the tool was 
doing and, thus, found it more difficult to use the tool 
than to do the task by hand. 


Infrequent, explicit data placement: Only 2 of 24 users 
had devices on which they regularly placed data in antic- 
ipation of needs in the near future. Instead, most users 
decided on a type of data that belonged on a device (e.g., 
“all my music” or “files for this semester’’) and rarely re- 
visited these decisions, usually only when prompted by 
environmental changes. Many did regularly copy new 
files matching each device’s data criteria onto it. 


2.2 Designing home storage 


From the insights above, we extract guidance that has 
informed our design of Perspective. 


Peer-to-peer architecture: While centralization can be 
appealing from a system simplicity standpoint, and has 
been a key feature in many distributed filesystems, it 
seems to be a non-starter with home users. Not only do 
many users struggle with the concept of managing a cen- 
tral server, many will be unwilling to invest the money 
necessary to build a server with sufficient capacity and 
reliability. We believe that a decentralized, peer-to-peer 
architecture more cleanly matches the realities we en- 
countered in our contextual analysis. 


Single class of replicas: Many previous systems have 
differentiated between two classes: permanent replicas 
stored on server devices and temporarily replicas stored 
on client devices (e.g., to provide mobility) [32, 25]. 
While this distinction can simplify system design, it in- 
troduces extra complexity for users, and prevents users 
from utilizing the capacity on client devices for reliabil- 


ity, which can be important for cost-conscious home con- 
sumers. Having only a single replica class removes the 
client-server distinction from the user’s perception and 
allows all peers to contribute capacity to reliability. 


Semantic naming for management: Using the same 
type of naming for both data access and management 
should be much easier for users who serve as their own 
administrators. Since home storage users have chosen 
semantic interfaces for data navigation, replica manage- 
ment tools should be adapted accordingly—users should 
be able to specify replica management policies applied 
to sets of files identified by semantic naming. 


In theory, applications could limit the mismatch by align- 
ing the underlying hierarchy to the application represen- 
tation. But, this alternative seems untenable, in practice. 
It would limit the number of attributes that could be han- 
dled, lock the data into a representation for a particular 
application, and force the user to sort data in the way 
the application desires. Worse, for data shared across ap- 
plications, vendors would have to agree on a common 
underlying namespace organization. 


Rule-based data placement: Users want to be able to 
specify file types (e.g., “Jerry’s music files’) that should 
be stored on particular devices. The system should al- 
low such rules to be expressed by users and enforced 
by the system as new files are created. In addition to 
helping users to get the right data onto the right devices, 
such support will help users to express specific replica- 
tion rules at the right granularity, to balance their relia- 
bility and cost goals. 


Transparent automation: Automation can simplify 
storage management, but many home users (like enter- 
prise sysadmins) insist on understanding and being able 
to affect the decisions made. By having automation tools 
use the same flexible semantic naming schemes as users 
do normally, it should be possible to create interfaces 
that express human-readable policy descriptions and al- 
low users to understand automated decisions. 


3 Perspective architecture 


Perspective is a distributed filesystem designed for home 
users. It is decentralized, enables any device to store 
and access any data, and allows decisions about what is 
stored where to be expressed or viewed semantically. 


Perspective provides flexible and comprehensible file or- 
ganization through the use of views. A view is a concise 
description of the data stored on a given device. Each 
view describes a particular set of data, defined by a se- 
mantic query, and a device on which the data is stored. A 
view-based replica management system guarantees that 
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any object that matches the view query will eventually 
be stored on the device named in the view. We will de- 
scribe our query language in detail in Section 4.1. 


Figure | illustrates a combination of management tools 
and storage infrastructure that we envision, with views 
serving as the connection between the two layers. Users 
can set policies through management tools, such as those 
described in Section 5, from any device in the system 
at any time. Tools implement these changes by manipu- 
lating views, and the underlying infrastructure (Perspec- 
tive) in turn enforces those policies by keeping files in 
sync among the devices according to the views. Views 
provide a clear division point between tools that allow 
users to manage data replicas and the underlying filesys- 
tem that implements the policies. 


View-based management enables the design points out- 
lined in Section 2.2. Views provide a primitive allowing 
users to specify meaningful rule-based placement poli- 
cies. Because views are semantic, they unify the naming 
used for data access and data management. Views are 
also defined in a human-understandable fashion, provid- 
ing a basis for transparent automation. Perspective pro- 
vides data reliability using views without restricting their 
flexibility, allowing it to use a single replica class. 


3.1 Placing file replicas 


In Perspective, the views control the distribution of data 
among the devices in the system. When a file is created 
or updated, Perspective checks the attributes of the file 
against the current list of views in the system and sends 
an update message to each device with a view that con- 
tains that file. Each device can then independently pull a 
copy of the update. 


When a device, A, receives an update message from an- 
other device, B, it checks that the updated file does, in- 
deed, match one or more views that A has registered. If 
the file does match, then A applies the update from B. 
If there is no match, which can occur if the attributes of 
a file are updated such that it is no longer covered by a 
view, then A ensures that there is no replica of the file 
stored locally. 


This simple protocol automatically places new files, and 
also keeps current files up to date according to the cur- 
rent views in the system. Simple rules described in Sec- 
tion 4.3 assure that files are never dropped due to view 
changes. 


Each device is represented by a file in the filesystem that 
describes the device and its characteristics. Views them- 
selves are also represented by files. Each device registers 
a view for all device and view files to assure they are 
replicated on all participating devices. This allows appli- 
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Figure 1: View-based architecture. Views are the interface between 
management tools and the underlying heterogeneous, disconnected in- 
frastructure. By manipulating the views, tools can specify data policies 
that are then enforced by the filesystem. 


cations to manage views through the standard filesystem 
interfaces, even if not all devices are currently present. 


3.2 View-based data management 


This subsection presents three scenarios to illustrate 
view-based management. 


Traveling: Harry is visiting Sally at her house and would 
like to play a new U2 album for her while he is at her 
house. Before leaving, he checks the views defined on 
his wireless music player and notices the songs are not 
stored on the device, though he can play them from his 
laptop where they are currently stored. He asks the mu- 
sic player to pull a copy of all U2 songs, which the player 
does by creating a new view for this data. When the syn- 
chronization is complete, the filesystem marks the view 
as complete, and the music player informs Harry. 


He takes the music player over to Sally’s house. Be- 
cause the views on his music player are defined only for 
his household and the views on Sally’s devices for her 
household, no files are synchronized. But, queries for 
“all music” initiated from Sally’s digital stereo can see 
the music files on Harry’s music player, so they can lis- 
ten to the new U2 album off of Harry’s music player on 
the nice stereo speakers, while he is visiting. 


Crash: Mike’s young nephew Oliver accidentally pushes 
the family desktop off of the desk onto the floor and 
breaks it. Mike and his wife Carol have each configured 
the system to store their files both on their respective lap- 
top and on the desktop, so their data is safe. When they 
set up the replacement computer, a setup tool pulls the 
device objects and views from other household devices. 
The setup tool gives them the option to replace an old de- 
vice with this computer, and they choose the old desktop 
from the list of devices. The tool then creates views on 
the device that match the views on the old desktop and 
deletes the device object for the old computer. The data 
from Mike and Carol’s laptops is transferred to the new 
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desktop in the background over the weekend. 


Short on space: Marge is working on a project for work 
on her laptop in the house. While she is working, a ca- 
pacity automation tool on her laptop alerts her that the 
laptop is short on space. It recommends that files cre- 
ated over two years ago be moved to the family desk- 
top, which has spare space. Marge, who is busy with her 
project, decides to allow the capacity tool to make the 
change. She later decides to keep her older files on the 
external hard drive instead, and makes the change using 
a view-editing interface on the desktop. 


4 Perspective design 


This section details three aspects of Perspective: seman- 
tic search and naming, consistent partial replication of 
sets of files, and reliability maintenance and reasoning. 


The Perspective prototype is implemented in C++ and 
runs at user-level using FUSE [10] to connect with the 
system. It currently runs on both Linux and Macintosh 
OS X. Perspective stores file data in files in a reposi- 
tory on the machine’s local filesystem and metadata in 
a SQLite database with an XML wrapper. Our prototype 
implements all of the features described in this paper ex- 
cept garbage collection and some advanced update log 
features. 


The prototype system has supporting one researcher’s 
household’s DVR, which is under heavy use; it is the ex- 
clusive television for him and his four roommates, and is 
also frequently used by sixteen other friends in the same 
complex. It has also stored one researcher’s personal data 
for about a year. It has also been the backing store for 
the DVR in the lounge for our research group for several 
months. We are preparing the system for deployment in 
several non-technical households for a wider, long-term 
user study over several months. 


4.1 Search and naming 


All naming in Perspective uses semantic metadata. 
Therefore, search is a very common operation both for 
users and for many system operations. Metadata queries 
can be made from any device and Perspective will return 
references to all matching files on devices currently ac- 
cessible (e.g., on the local subnet), which we will call 
the current device ensemble [33]. Views allow Perspec- 
tive to route queries to devices containing all needed files 
and, when other devices suffice, avoid sending queries 
to power-limited devices. While specialized applications 
may use the Perspective API directly, we expect most ap- 
plications to access files through the standard VFS layer, 
just as they access other filesystems. Perspective pro- 
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vides this access using front ends that support a variety 
of user-facing naming schemes. These names are then 
converted to Perspective searches, which are then passed 
on to the filesystem. Our current prototype system imple- 
ments four front ends that each support a different orga- 
nization: directory hierarchies, faceted metadata, simple 
search, and hierarchies synthesized from the values of 
specific tags. 


Query language and operations: We use a query lan- 
guage based on a subset of the XPath language used for 
querying XML. Our language includes logic for com- 
paring attributes to literal values with equality, standard 
mathematical operators, string search, and an operator 
to determine if a document contains a given attribute. 
Clauses can be combined with the logical operators and, 
or, and not. Each attribute is allowed to have a sin- 
gle value, but multi-value attributes can be expressed in 
terms of single value attributes, if necessary. We require 
all comparisons to be between attributes and constant 
values. 


In addition to standard queries, we support two opera- 
tions needed for efficient naming and reliability analy- 
sis. The first is the enumerate values query, which re- 
turns all values of an attribute found in files matching 
a given query. The second is the enumerate attributes 
query, which returns all the unique attributes found in 
files matching a given query. These operations must be 
efficient; fortunately we can support them at the database 
level using indices, which negate the need for full enu- 
meration of the files matching the query. 


This language is expressive enough to capture many 
common data organization schemes (e.g., directories, 
unions [27], faceted metadata [43], and keyword search) 
but is still simple enough to allow Perspective to effi- 
ciently reason about the overlap of queries. Perspective 
can support any of the replica management functions de- 
scribed in this paper for any naming scheme that can be 
converted into this language. 


Overlap evaluation is commonly used to compare two 
queries. The overlap evaluation operation returns one 
of three values when applied to two queries: one query 
subsumes the other, the two queries have no-overlap, 
or the relationship between them is unknown. Note 
that the comparison operator is used for efficiency but 
not correctness, allowing for a trade-off between lan- 
guage complexity and efficiency. For example, Perspec- 
tive can determine that the query all files where date 
< January, 2008 is subsumed by the query all files 
where date < June, 2008, and that the query all files 
where owner=Brandon does not overlap with the query 
all files where owner=Greg. However, it cannot deter- 
mine the relationship between the queries all files where 
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type=Music and all files where album=The Joshua Tree. 
Perspective will correctly handle operations on the latter 
two queries, but at some cost in efficiency. 


4.2 Partial replication 


Perspective supports partial replication among the de- 
vices in a home. Devices in Perspective can each store 
disjoint sets of files — there is no requirement that any 
master device store all files or that any device mirror an- 
other in order to maintain reliability. Previous systems 
have supported either partial replication [16, 32] or topol- 
ogy independence [40], but not both. PRACTI [7] pro- 
vided a combination of the two properties tied to direc- 
tories, but probably could be extended to work in the se- 
mantic case. Recently, Cimbiosis [28] has also provided 
partial replication with effective topology independence, 
although it requires all files to be stored on some mas- 
ter device. We present Perspective’s algorithms to show 
that it is possible to build a simple, efficient consistency 
protocol for a view-based system, but a full comparison 
with previous techniques is beyond the scope of this pa- 
per. The related work section presents the differences 
and similarities with previous work. 


Synchronization: Devices that are not currently acces- 
sible at the time of an update will receive that update at 
synchronization time, when the two devices exchange in- 
formation about updates that they may have missed. De- 
vice and view files are always synchronized before other 
files, to make sure the device does not miss files matching 
new views. Perspective employs a modified update log 
to limit the exchanges to only the needed information, 
much like the approach used in Bayou [40]. However, 
the flexibility of views makes this task more challenging. 


For each update, the log contains the metadata for the file 
both before and after the update. Upon receiving a sync 
request, a device returns all updates that match the views 
for the calling device either before or after the update. As 
in Bayou, the update log is only an optimization; we can 
always fall back on full file-by-file synchronization. 


Conventional full synchronization can be problematic for 
heterogeneous devices with partial replication, especially 
for resource- and battery-limited devices. For example, if 
a cell phone syncs with a desktop computer, it is not fea- 
sible for the cell phone to process all of the files on the 
desktop, even occasionally. To address this problem, Per- 
spective includes a second synchronization option. Con- 
tinuing the example, the cell phone first asks the desktop 
how many updates it would return. If this number is too 
large, the cell phone can pass the metadata of all the files 
it owns to the desktop, along with the view query, and 
ask the desktop for updates for any files that match the 
view or are contained in the set of files currently on the 
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cell phone. At each synchronization, the calling device 
can choose either of these two methods, reducing syn- 
chronization costs to O(Nematter), Where Nematter 1s the 
number of files stored on the smaller device. 


Full synchronizations will only return the most recent 
version of a file, which may cause gaps in the update 
logs. If the update log has a gap in the updates for a 
file, recognizable by a gap in the versions of the before 
and after metadata, the calling device must pass this up- 
date back to other devices on synchronization even if 
the metadata does not match the caller’s views, to avoid 
missing updates to files which used to match a view, but 
now do not. 


Consistency: As with many file systems that support 
some form of eventual consistency, Perspective uses ver- 
sion vectors and epidemic propagation to ensure that all 
file replicas eventually converge to the same version. 
Version vectors in Perspective are similar to those used 
in many systems; the vector contains a version for each 
replica that has been modified. Because Perspective does 
not have the concept of volumes, it does not support 
volume-level consistency like Bayou. Instead, it supports 
file-level consistency, like FICUS [16]. 


To keep all file replicas consistent, we need to assure that 
updates will eventually reach all replicas. If all devices 
in the household sync with one another occasionally, this 
property will be assured. While this is a reasonable as- 
sumption in many homes, we do not require full pair- 
wise device synchronization. Like many systems built on 
epidemic propagation, a variety of configurations satisfy 
this property. For example, even if some remote device 
(e.g., a work computer) never returns home, the property 
will still hold as long as some other device that syncs 
with the remote device and does return home (e.g., a lap- 
top) contains all the data stored on the remote device. 
System tools might even create views on such devices to 
facilitate such data transfer, similar to the routing done in 
Footloose [24]. Alternately, a sync tree, as that used in 
Cimbiosis [28] could be layered on top of Perspective to 
provide connectedness guarantees. 


By tracking the timestamps of the latest complete sync 
operation for each device in the household, devices pro- 
vide a freshness timestamp for each view. Perspective 
can guarantee that all file versions created before the 
freshness timestamp for a view are stored on that view’s 
device. It can also recommend sync operations needed to 
advance the freshness timestamp for any view. 


Conflicts: Any system that supports disconnected oper- 
ation must deal with conflicts, where two devices mod- 
ify the same file without knowledge of the other de- 
vice’s modification. We resolve conflicts first with a pre- 
resolver, which uses the metadata of the two versions to 
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deterministically choose a winning and losing version. 
Our pre-resolver can be run on any device without any 
global agreement. It uses the file’s modification time and 
then the sorted version vector in the case of a tie. But, in- 
stead of eliminating the losing version, the pre-resolver 
creates a new file, places the losing version in this new 
file. It then tags the new file with all metadata from the 
losing version, as well as tags marking it as a conflict file 
and tying it to the winning version. Later, a full resolver, 
which may ask for user input or use more sophisticated 
logic, can search for conflict objects, remove duplicates, 
and adjust the resolution as desired. 


Capacity management: Pushing updates to other de- 
vices can be problematic if those devices are at full ca- 
pacity. In this case, the full device will refuse subsequent 
updates, and mark the device file noting that the device 
is out of space. Until a user or tool corrects the problem, 
the device will continue to refuse updates, although other 
devices will be able to continue. However, management 
tools built on top of Perspective should help users ad- 
dress capacity problems before they arise. 


File deletion: As in many other distributed filesystems, 
when a file is removed, Perspective keeps a tombstone 
marker that assures all replicas of the file in the system 
are deleted, but is ignored by all naming operations. Per- 
spective uses a two-phase garbage collection mechanism, 
like that used in FICUS, between all devices with views 
that match the file to which the tombstone belongs. Note 
that deletion of a file removes all replicas of a file in the 
system, which is a different operation from dropping a 
particular replica of a file (done by manipulating views). 
This distinction also exists in any distributed filesystem 
allowing replication. 


View and device objects: Each device is only required 
to store view and device objects from devices that con- 
tain replicas of files it stores, although they must also 
temporarily store view and device files for devices in the 
current ensemble in order to access their files. Because 
views are very small (hundreds of bytes), this is tractable, 
even for small devices like cell phones. 


4.3 Reliability with partial replication 


In order to manage data semantically, users must be 
able to provide fault-tolerance on data split semanti- 
cally across a distributed set of disconnected, eventually- 
consistent devices. Perspective enables semantic fault- 
tolerance through two new algorithms, and provides a 
way to efficiently reason about the number of replicas of 
arbitrary sets of files. It also assures that data is never 
lost despite arbitrary and disconnected view manipula- 
tion using three simple distributed update rules. 


Reasoning about number of replicas: Reasoning about 
the reliability of a storage system — put simply, deter- 
mining the level of replication for each data item — is a 
challenge in a partially-replicated filesystem. Since de- 
vices can store arbitrary subsets of the data, there are no 
simple rules that allow all of the replicas to be counted. 
A naive solution would be to enumerate all of the files 
on each device and count replicas. Unfortunately, this 
would be prohibitively expensive and would be possible 
only if all devices are currently accessible. 


Fortunately, Perspective’s views compactly and fully de- 
scribe the location of files in terms of their attributes. 
Since there are far fewer views than there are file repli- 
cas in the system, it is cheaper to reason about the num- 
ber of times a particular query is replicated among all of 
the views in the system than to enumerate all replicas. 
The files in question could be replicated exactly (e.g., all 
of the family’s pictures are on two devices), they could 
be subsumed by multiple views (e.g., all files are on the 
desktop and all pictures are on the laptop), or they could 
be replicated in part on multiple devices but never in full 
on any one device (e.g., Alice’s pictures are on her laptop 
and desktop, while Bob’s pictures are on his laptop and 
desktop — among all devices, the entire family’s pictures 
have two replicas). 


To efficiently reason about how views overlap, Perspec- 
tive uses overlap trees. An overlap tree encapsulates the 
location of general subsets of data in the system, and 
thus simplifies the task of determining the location of the 
many data groupings needed by management tools. An 
overlap tree is currently created each time a management 
application starts, and then used throughout the applica- 
tion’s runtime to answer needed overlap queries. 


Overlap trees are created using the enumeration queries 
described in Section 4.1. Each node contains a query, 
that describes the set of data the node represents. Each 
leaf node represents a subset of files whose location can 
be precisely quantify using the views and records the de- 
vices that store that subset. Each interior node of the tree 
encodes a subdivision of the attribute space, and contains 
a list of child nodes, each of which represents a subset of 
the files that the parent node represents. We begin build- 
ing the tree by enumerating all of the attributes that are 
used in the views found in the household. 


We create a root node for the tree to represent all files, 
choose the first attribute in our attribute list, and use the 
enumerate values query to find all values of this attribute 
for the current node’s query. We then create a child 
node from each value with a query of the form <parent 
query> and attribute=value. We compare the query for 
each child node against the complete views on all de- 
vices. If the compare operator can give a precise answer 
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(i.e., not unknown) for whether the query for this node 
is stored on each device in the home, then this node is a 
leaf and we can stop dividing. Otherwise, we recursively 
perform this operation on the child node, dividing it by 
the next attribute in our list. Figure 2 shows an exam- 
ple overlap tree. The ordering of the attribute list could 
be optimized to improve performance of the overlap tree, 
but in the current implementation we leave it unordered. 


When we create an overlap tree, we may not have all the 
information needed to construct the tree. For example, if 
we currently only have access to Brian’s files, we may in- 
correctly assume that all music files in the household are 
owned by Brian, when music files owned by Mary exist 
elsewhere in the system. The tree construction mecha- 
nism makes a notation in a node if it cannot guarantee 
that all matching files are available via the views. When 
checking for overlaps, if a marked node is required the 
tree will return an unknown value, but it will still cor- 
rectly compute overlaps for queries that do not require 
these nodes. To avoid this restriction, devices are free 
to cache and update an overlap tree, rather than recreat- 
ing the overlap tree when each management application 
starts. The tree is small, making caching it easy. To keep 
it up to date, a device can publish a view for all files, and 
then use the updates to keep the cached tree up to date. 


Once we have constructed the overlap tree, we can use 
it to determine the location and number of full copies in 
the system of the files for any given query. Because the 
tree caches much of the overlap processing, each indi- 
vidual query request can be processed efficiently. We 
do so by traversing all of the leaf nodes and finding 
those that overlap with the given view or query. We 
may occasionally need to perform more costly overlap 
detection, if the attribute in a leaf node does not match 
any of the attributes in the query. For example, in the 
overlap tree in Figure 2, if we were checking to see if 
the query album=Joshua Tree was contained in the node 
owner=Mary and type=Music we would use the enumer- 
ate values query to determine the values of “type” for the 
query album=Joshua Tree and owner=Mary. If “Mu- 
sic” is the only value, then we can count this node as a 
full match in our computations. Otherwise, we cannot. 
This extra comparison is only valid if we can determine 
via the views that all files in the query for which we are 
computing overlaps are accessible 


Attributes with larger cardinalities can be handled more 
efficiently by selectively expanding the tree. For exam- 
ple, if a view is defined on date < T, we need only expand 
the attribute date into three sub-nodes, one for date < T, 
one for date > T, and one for has no date attribute. 


Note that the number of attributes used in views at any 
one time is likely to be much smaller than the total num- 
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Figure 2: Overlap tree. This figure shows an example overlap 
tree, constructed from a three-device, three-view scenerio: Brian’s files 
stored on Brian’s laptop, Mary’s files stored on Mary’s laptop, and 
Mary’s music stored on the Family desktop. Shaded nodes are inte- 
rior nodes, unshaded nodes are leaf nodes. Each leaf node lists whether 
this query is stored on each device the household. 


ber of attributes in the system, and both of these will be 
much smaller than the total number of files or replicas. 
For example, in our contextual analysis, most house- 
holds described settings requiring around 20 views and 
5 attributes. None of households we interviewed de- 
scribed more than 30 views, or more than 7 attributes. 
Because the number of relevant attributes is small, over- 
lap tree computations are fast enough to allow us to com- 
pute them in real time as the user browses files. We will 
present a performance comparison of overlap trees to the 
naive approach in Section 6. 


Update rules: Perspective maintains permanence by 
guaranteeing that files will never be lost by changes to 
views or addition or removal of devices, regardless of 
the order, timing, or origin of the changes, freeing the 
user from worrying about these problems when making 
view changes. We also provide a guarantee that, once 
a version of a file is stored on the devices associated 
with all overlapping views, it will always be stored in 
all overlapping views, which provides a strong assurance 
on the number of copies in the system based on the cur- 
rent views. View freshness timestamps, as described in 
Section 4.2, allow Perspective to guarantee that all up- 
dates created before a given timestamp are safely stored 
in the correct locations, and thus have the fault-tolerance 
implied by the views. These guarantees are assured by 
careful adherence to three simple rules: 1) When a file 
replica is modified by a device, it is marked as “modi- 
fied’ Devices cannot evict modified replicas. Once a 
modified replica has been pulled by a device holding a 
view covering it, the file can be marked as unmodified 
and then removed. 2) A newly created view cannot be 
considered complete until it has synced with all devices 
with overlapping views or synced with one device with 
a view that subsumes the new view. 3) When a view is 
removed, all replicas in it are marked as modified. The 
replicas are then removed when they conform to rule 1. 
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These rules ensure that devices will not evict modified 
replicas until they are safely on some “stable” location 
(i.e., in a completely created view). The rules also assure 
that a device will not drop a file until it has confirmed that 
another up-to-date replica of the file exists somewhere in 
the system. However, a user can force the system to drop 
a file replica without assuring another replica exists, if 
she is confident that another replica exists and is willing 
to forgo this system protection. With these rules, Per- 
spective can provide permanence guarantees without re- 
quiring central control or limiting when or where views 
can be changed. 


4.4 Security and cross-household sharing 


Security is not a focus in this paper, but is certainly a 
concern for users and system designers alike. While 
Perspective does not currently support it, we envision 
using mechanisms such as those promoted by the UIA 
project [8]. Our current prototype supports voluntary ac- 
cess control using simplified access control lists. While 
all devices are able to communicate and share replicas 
with one another, even aside from security concerns it 
is helpful to divide households from one another to di- 
vide management and view specification. To do so, Per- 
spective maintains a household ID for each device and 
each file. Views are specified on files within the given 
household, to avoid undesired cross-syncing. However, 
the fundamental architecture of Perspective places no re- 
strictions on how these divisions are made. 


5 View manager interface 


To explore view-based management, we built a view 
manager tool to allow users to manipulate views. 


Customizable faceted metadata: One way of visual- 
izing and accessing semantic data is through the use of 
faceted metadata [43]. Faceted metadata allows a user 
to choose a first attribute to use to divide the data and 
a value at which to divide. Then, the user can choose 
another attribute to divide on, and so on. Faceted meta- 
data helps users browse semantic information by giving 
them the flexibility to divide the data as needed. But, it 
can present the user with a dizzying array of choices in 
environments with large numbers of attributes. 


To curb this problem, we developed customizable faceted 
metadata (CFM), which exposes a small user-selected 
set of attributes as directories plus one additional other 
groupings directory that contains a full list of possible at- 
tributes. The user can customize which attributes are dis- 
played in the original list by moving folders between the 
base directory and the other groupings directory. These 


preferences are saved in a customization object in the 
filesystem. The file structure on the left side of the in- 
terface in Figure 3 illustrates CFM. Perspective exposes 
CFM through the VFS layer, so it can be accessed in the 
same way as a normal hierarchical filesystem. 


View manager interface: The view manager interface 
(Figure 3), allows users to create and delete views on 
devices and to see the effects of these actions. This GUI 
is built in Java and makes calls into the view library of 
the underlying filesystem. 


The GUI is built on Expandable Grids [29], a user inter- 
face concept initially developed to allow users to view 
and edit file system permissions. Each row in the grid 
represents a file or file group, and each column repre- 
sents a device in the household. The color of a square 
represents whether the files in the row are stored on the 
device in the column. The files can be “all stored” on 
the device, “some stored” on the device, or “not stored” 
on the device. Each option is represented by a different 
color in the square. By clicking on a square a user can 
add or remove the given files from the given device. Sim- 
ilarly to file permissions, this allows users to manipulate 
actual storage decisions, instead of rule lists. 


An extra column, labeled “Summary of failure protec- 
tion,” shows whether the given set of files is protected 
from one failure or not, which is true if there are at least 
two copies of each file in the set. By clicking on an 
unbacked-up square, the user can ask the system to as- 
sure that two copies of the files are stored in the system, 
which it will do by placing any extra needed replicas on 
devices with free space. 


An extra row contains all unique views and where they 
are stored, allowing a user to see precisely what data is 
stored on each device at a glance. 


6 Evaluation 


Our experience from working with many home storage 
users suggests that users are very concerned about the 
time and effort spent managing their devices and data at 
home, which has motivated our design of Perspective, as 
well as our evaluation. Therefore, we focus our study pri- 
marily on the usability of Perspective’s management ca- 
pabilities and secondarily on its performance overhead. 


We conducted a lab study in which non-technical users 
used Perspective, outfitted with appropriate user inter- 
faces, to perform home data management tasks. We mea- 
sured accuracy and completion time of each task. In or- 
der to insulate our results as much as possible from the 
particulars of the user interface used for each primitive, 
we built similar user interfaces for each primitive using 


7th USENIX Conference on File and Storage Technologies — 175 


176 





Legend 


Each square shows whether the files 
in the row are stored on the device 
in the column. Click on the square 
to add or remove files from a device. 





None of these files stored here 
Some of these files stored here 
BB All of these files stored here 
Not all protected from failure 
All protected from one failure 
Blank cell 



































Summary of failure protection 
Brian laptop (40.0G8 free) 
Family desktop (60.0GB free) 
Mary laptop (40.0G8 free) 


TiVo (80.0GB free) 


‘w Devices 





wv Q Files 














v Q Allfiles 








‘W Q {All files} grouped by owner 








> @ Brian 








wv Q Family 








‘Ww Q {Family files) grouped by type 








> Q Movies 








> Q TVShows 








> Q {Family files} other groupings 








> Q Mary 






































Db @ {All files} arouned by type 





Figure 3: View manager interface. A screen shot of the view man- 
ager GUI. On the left are files, grouped using faceted metadata. Across 
the top are devices. Each square shows whether the files in the row are 
stored on the device in the column. 


the Expandable Grids UI toolkit [29]. 


Views-facet interface: The views-facet interface was 
described in Section 5. It uses CFM to describe data, 
and allows users to place any set of data described by the 
faceted metadata on any device in the home. 


Volumes interface: This user interface represents a sim- 
ilar interface built on top of a more conventional volume- 
based system with directory hierarchies. Each device is 
classified as a client or server, and this distinction is listed 
in the column along with the device name. The volumes 
abstraction only allows permanent copies of data to be 
placed on servers, and it restricts server placement poli- 
cies on volume boundaries. We defined each root level 
directory (based on user) as a volume. The abstraction 
allows placement of a copy of any subtree of the data 
on any client device, but these replicas are only tempo- 
rary caches and are not guaranteed to be permanent or 
complete. The interface distinguishes between tempo- 
rary and permanent replicas by color. The legend dis- 
plays a summary of the rules for servers and permanent 
data and for clients and temporary data. 


Views-directory interface: To tease apart the effects of 
semantic naming and using a single replica class, we 
evaluated an intermediate interface, which replaces the 
CFM organization with a traditional directory hierarchy. 
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Otherwise, it is identical to the views-facet interface. In 
particular, it allows users to place any subtree of the hi- 
erarchy on any device. 


6.1 Experiment design 


Our user pool consisted of students and staff from nearby 
universities in non-technical fields who stated that they 
did not use their computers for programming. We did a 
between-group comparison, with each participant using 
one of the three interfaces described above. We tested 
10 users in each group, for a total of 30 users overall. 
The users performed a think-aloud study in which they 
spoke out loud about their current thoughts and read out 
loud any text they read on the screen, which provides in- 
sight into the difficulty of tasks and users’ interpretation. 
All tasks were performed in a latin square configuration, 
which guarantees that every task occurs in each position 
in the ordering, and each task is equally likely to follow 
any other task. 


We created a filesystem with just over 3,000 files, based 
on observations from our contextual analysis. We cre- 
ated a setup with two users, Mary and Brian, and a 
third “Family” user with some shared files. We modeled 
Brian’s file layout on the Windows music and pictures 
tools and Mary’s on Apple’s iTunes and iPhoto file trees. 
Our setup included four devices: two laptops, a desk- 
top, and a DVR. We also provided the user with iTunes 
and iPhoto, with the libraries filled with all of the match- 
ing data from the filesystem. This allowed us to evaluate 
how users convert from structures in the applications to 
the underlying filesystem. 


6.2 Tasks 


Each participant performed the same set of tasks, which 
we designed based on our contextual analysis. We started 
each user with a 5 to 10 minute training task, after which 
our participants performed 10 data management tasks. 
While space constraints preclude us including the full 
text for all of them, as we discuss each class of tasks, 
we include the text of one example task. For this study, 
we chose tasks to illustrate the differences between the 
approaches. A base-case task that was similar in all inter- 
faces confirmed that, on these similar tasks, all interfaces 
performed similarly. The tasks were divided into two 
types: single replica tasks, and data organization tasks. 


Single replica tasks: Two single replica tasks (LH and 
CB) required the user to deal with distinctions between 
permanent and temporary replicas to be successful. 


Example task, Mary’s laptop comes home (LH): “Mary 
has not taken her laptop on a trip with her for a while 
now, so she has decided to leave it in the house and make 
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an extra copy of her files on it, in case the Family desktop 
fails. However, Brian has asked her not to make extra 
copies of his files or of the Family files. Make sure Mary’s 
files are safely stored on her laptop.” 


Mary’s laptop was initially a client in the volume case. 
This task asked the user to change it to a server before 
storing data there. This step was not required for the sin- 
gle replica class interfaces, as all devices are equivalent. 


Note that because server/client systems, unlike Perspec- 
tive, are designed around non-portable servers for sim- 
plicity, it is not feasible to simply make all devices 
servers. Indeed, the volume interface actually makes 
this task much simpler than current systems; in the vol- 
ume interface, we allow the user to switch a device from 
server to client using a single menu option, where current 
distributed filesystems require an offline device reformat. 


Data organization tasks: The data organization tasks 
required users to convert from structures in the iTunes 
and iPhoto applications into the appropriate structures in 
the filesystem. This allowed us to test the differences be- 
tween a hierarchical and semantic, faceted systems. The 
data organization tasks are divided into three types: ag- 
gregation, comprehension, and sparse collection tasks. 


Aggregation: One major difference between semantic 
and hierarchical systems is that because the hierarchy 
forces a single tree, tasks that do not match the current 
tree require the user to aggregate data from multiple di- 
rectories. This is a natural case as homes fill with ag- 
gregation devices and data is shared across users and de- 
vices. However, in a hierarchical system, it is difficult for 
users to know all the folders that correspond to a given 
application grouping. Users often erroneously assumed 
all the files for a given collection were in the same folder. 
The semantic structure mitigates this problem, since the 
user is free to use a filesystem grouping suited to the cur- 
rent specific task. 


Example task, U2 (U2): “Mary and Brian share music 
at home. However, when Mary is on trips, she finds that 
she can’t listen to all the songs by U2 on her laptop. She 
doesn’t listen to any other music and doesn’t want other 
songs taking up space on her laptop, but she does want 
to be able to listen to U2. Make sure she can listen to all 
music by the artist U2 on her trips.” 


As may often be the case in the home, the U2 files were 
spread across all three user’s trees in the hierarchical in- 
terfaces. The user needed to use iTunes to locate the var- 
ious folders. The semantic system allowed the user to 
view all U2 files in a single grouping. 


Aggregation is also needed when applications sort data 
differently from what is needed for the current task. For 
example, iPhoto places modified photos in a separate 


folder tree from originals, making it tricky for users to 
get all the files for a particular event. The semantic struc- 
ture allows applications to set and use attributes, while 
allowing the user to group data as desired. 


Example task, Rafting (RF): “Mary and Brian went on 
a rafting trip and took a number of photos, which Mary 
remembers they labeled as ‘Rafting 2007’. She wants to 
show her mother these photos on Mary’s laptop. How- 
ever, she doesn’t want to take up space on her laptop for 
files other than the ‘Rafting 2007’ files. Make sure Mary 
can show the photos to her mother during her visit.” 


The rafting photos were initially in Brian’s files, but 
iPhoto places modified copies of photos in a separate di- 
rectory in the iPhoto tree. To find both folders, the user 
needed to explore the group in iPhoto. The semantic sys- 
tem allows iPhoto to make the distinction, while allowing 
the user to group all files from this roll in one grouping. 


Comprehension: Applications can allow users to set 
policies on application groupings, and then convert them 
into the underlying hierarchy. However, in addition to 
requiring multiple implementations and methods for the 
same system tasks, this leads to extremely messy under- 
lying policies, which make it difficult for users to under- 
stand, especially when viewing it from another applica- 
tion. In contrast, semantic systems can retain a descrip- 
tion of the policy as specified by the application, making 
them easier for users to understand. 


Example task, Traveling Brian (TB): “Brian is taking a 
trip with his laptop. What data will he be able to access 
while on his trip? You should summarize your answer 
into two classes of data.” 


Brian’s laptop contained all of his files and all of the 
music files in the household. However, because iTunes 
places TV shows in the Music repository, the settings 
included all of the music subfolders, but not the “TV 
Shows” subfolder, causing confusion. In contrast, the 
semantic system allows the user to specify both of these 
policies in a single view, while still allowing applications 
to sort the data as needed. 


Note that this particular task would be simpler if iTunes 
chose to sort its files differently, but the current iTunes 
organization is critical for other administrative tasks, 
such as backing up a user’s full iTunes library. It is im- 
possible to find a single hierarchical grouping that will 
be suited to all needed operations. This task illustrates 
how these kinds of mismatches occur even for common 
tasks and well-behaved applications. 


Sparse collection: Two sparse collection tasks (BF and 
HV) required users to make policies on collections that 
contain single files from across the tree, such as song 
playlists. These structures do not lend themselves well 
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to a hierarchical structure, so they are kept externally 
in application structures, forcing users to re-create these 
policies by hand. In contrast, semantic structures allow 
applications to push these groupings into the filesystem. 


Example task, Brian favorites (BF): “Brian is taking a 
trip with his laptop. He doesn’t want to copy all music 
onto his laptop as he is short on space, but he wants to 


9999 


have all of the songs on the playlist “Brian favorites”. 


Because the playlist does not exist in the hierarchy, the 
user had to add the nine files in the playlist individually, 
after looking up the locations using iTunes. In the se- 
mantic system, the playlist is included as a tag, allowing 
the user to specify the policy in a single step. 


6.3 Results 


All of the statistically significant comparisons are in fa- 
vor of the facet interface over the alternative approaches, 
showing the clear advantage of semantic management for 
these tasks. For the single replica tasks the facet and di- 
rectory interfaces perform comparably, as expected, with 
an average accuracy of 95% and 100% respectively, com- 
pared to an average of 15% for the volume interface. 
For the data organization tasks, the facet interface out- 
performs the directory and volume interfaces with an av- 
erage accuracy of 66% compared to 14% and 6% respec- 
tively. Finally, while the accuracy of sparse tasks is not 
significantly different, the average time for completion 
for the facet interface is 73 seconds, compared to 428 
seconds for the directory interface and 559 seconds for 
the volume interface. We discuss our statistical compar- 
isons and the tasks in more detail in this section. 


Statistical analysis: We performed a statistical analy- 
sis On our accuracy results in order to test the strength 
of our findings. Because our data was not well-fitted to 
the chi-squared test, we used a one-sided Fisher’s Ex- 
act Test for accuracy and a t-test to compare times. We 
used Benjamini-Hochberg correction to adjust our p val- 
ues to correct for our use of multiple comparisons. As is 
conventional in HCI studies, we used a = .05. All the 
comparisons mentioned in this section were statistically 
significant, except where explicitly mentioned. 


Single replica tasks: Figure 4 shows results from the 
single replica tasks. As expected, the directory and view 
interfaces, which both have a single replica class, per- 
form equivalently, while the volume interface suffers 
heavily due to the extra complexity of two distinct replica 
classes. The comparisons between the single replica in- 
terfaces and the volume interface are all statistically sig- 
nificant. We do not show times, because they showed no 
appreciable differences. 


Data organization tasks: Results from the three aggre- 
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Figure 4: Single replica task results. This graph shows the results of 
the single replica tasks. 
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Figure 5: Data organization task results. This graph shows the 
results from the aggregation and comprehension tasks. 


gation tasks (U2, RF, and TV), and the two compre- 
hension tasks (TB and TM) are shown in Figure 5. As 
expected, the faceted metadata approach performs sig- 
nificantly better than the alternative approaches, as the 
filesystem structure more closely matches that of the ap- 
plications. The facet interface is statistically better than 
both the other interfaces in the aggregation tasks, but we 
would need more data for statistical significance for the 
comprehension tasks. 


Figure 6 shows the accuracy and time metrics for the 
sparse tasks (BF and HV). Note that none of the accu- 
racy comparisons are statistically significant. This is be- 
cause in the sparse tasks, each file is in a unique location, 
making the correlation between application structure and 
filesystem structure clear, but very inconvenient. In con- 
trast, for the other aggregation tasks the correlation be- 
tween application and structures and the filesystem was 
hazy, leading to errors. However, setting the policy on 
each individual file was extremely time consuming, lead- 
ing to a statistically significant difference in times. The 
one exception is the HV task, where too few volume 
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Figure 6: Sparse collection task results. This graph shows the results 
from all of the sparse collection tasks. 


users correctly performed the task to allow comparison 
with the other interfaces. Indeed, the hierarchical inter- 
faces took an order of magnitude longer than the facet in- 
terface for these tasks. Thus re-creating the groups was 
difficult, leading to frustration and frequent grumbling 
that “there must be a better way to do this.” 


6.4 Performance evaluation 


We have found that Perspective generally incurs neg- 
ligible overhead over the base filesystem, and its per- 
formance is sufficient for everyday use. Using overlap 
trees to reason about the location of files based on the 
available views is a significant improvement over sim- 
pler schemes. All our tests were run on a MacBook Pro 
2.5GHz Intel Core Duo with 2GB RAM running Macin- 
tosh OS X 10.5.4. 


Performance overhead: Our benchmark writes 200 
4MB files, clearing the cache by writing a large amount 
of data elsewhere and then re-reading all 800MB. This 
sequential workload on small files simulates common 
media workloads. For these tasks, we compared Per- 
spective to HFS+, the standard OS X filesystem. Writing 
the files on HFS+ and Perspective took 18.1 s and 18.6s, 
respectively. Reading them took 17.0s and 17.2s, re- 
spectively. Perspective has less than a 3% overhead in 
both phases of this benchmark. In a more real-world sce- 
nario, Perspective has been used by the authors for sev- 
eral months as the backing store for several multi-tuner 
DVRs, without performance problems. 


Overlap trees: Overlap trees allow us to efficiently com- 
pute how many copies of a given file set are stored in 
the system, despite the more flexible storage policies that 
views provide. It is important to make this operation ef- 
ficient because, while it is only used in administration 
tasks, these tasks require calculation of a large number 
of these overlaps in real time as the user browses and 
manipulates data placement policies. 


Figure 7 summarizes the benefits of overlap trees. We 
compared overlap trees to a simple method that enumer- 
ates all matching files and compares them against the 
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Num files | Create OT | OT no probe | OT w/ probe | Simple 

100 9.6ms 0.3ms 3.5ms 961ms (.9sec) 
1000 29ms 0.6ms 3.8ms 12759ms (12sec) 
10000 249ms 0.6ms 3.4ms 95049ms (95sec) 


























Figure 7: Overlap tree benchmark. This table shows the results 
from the overlap tree benchmark. It compares the time to create a tree 
and perform an overlap comparison, with or without probes, and com- 
pares to the simple enumerate approach. The results are the average of 
10 runs. 


views in the system. We break out the cost for tree 
creation and then the cost to compute an overlap. The 
“probe” case uses a query and view set that requires the 
overlap tree to probe the filesystem to compute the over- 
lap, while the “no probe” case can be determined solely 
through query comparisons. Overlap trees take a task 
that would require seconds or minutes and turns it into a 
task requiring milliseconds. 


7 Related work 


A primary contribution of Perspective is the use of se- 
mantic queries to manage the replication of data. Specif- 
ically, it allows the system to provide accessibility and 
reliability guarantees over semantic, partially replicated 
data. This builds on previous semantic systems that used 
queries to locate data, and hierarchies to manage data. 
Our user study evaluation shows that, by supporting se- 
mantic management, Perspective can simplify important 
management tasks for end users. 


Another contribution is a filesystem design based on in- 
situ analysis of the home environment. This overall de- 
sign could be implemented on top of a variety of under- 
lying filesystem implementations, but we believe that a 
fully view-based system provides simplicity to both user 
and designer by keeping the primitives similar through- 
out the system. While no current system provides all 
of the features of Perspective, Perspective builds on a 
wealth of previous work in data placement, consistency, 
search and publish/subscribe event notification. In this 
section we discuss this related work. 


Data placement: Views allow flexible data placement 
used to provide both reliability and mobility. Views are 
another step in a long progression of increasingly flexible 
data placement schemes. 


The most basic approach to storing data in the home is to 
put all of the data on a single server and make all other 
devices in the home clients of this server. Variations of 
this approach centralize control, while allowing data to 
be cached on devices [18, 35]. 


To provide better reliability, AFS [34] expanded the sin- 
gle server model to include a tier of replicated servers, 
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each connected in a peer-to-peer fashion. However, 
clients cannot access data when they are out of contact 
with the servers. Coda [32] addressed this problem by 
allowing devices to enter a disconnected mode, in which 
devices use locally cached data defined by user hoard- 
ing priorities. However, hoarded replicas do not provide 
the reliability guarantees allowed by volumes because 
devices make no guarantee about what data resides on 
what devices, or how long they will keep the data they 
currently store. Views extend this notion by allowing 
volume-style reliability guarantees along with the flex- 
ibility of hoarding in the same abstraction. 


A few filesystems suggested even more flexible methods 
of organizing data. BlueFS extended the hoarding prim- 
itive to allow client devices to access data hoarded on 
portable storage devices, in addition to the local device, 
but did not explore the use of this primitive for accessi- 
bility or reliability beyond that provided by Coda [21]. 
Footloose [24] proposed allowing individual devices to 
register for data types in this kind of system as an alter- 
native to hoarding files, but did not expand it to general 
publish/subscribe-style queries, or explore how to use 
this primitive for mobility and reliability management or 
for distributed search. 


Consistency: Perspective supports decentralized, 
topology-independent consistency for semantically- 
defined, partially replicated data, a critical feature for 
the home environment. While no previous system 
provides these properties out of the box, PRACTI [7] 
also provides a framework for topology-independent 
consistency of partially replicated data over directo- 
ries, in addition to allowing a group of sophisticated 
consistency guarantees. PRACTI could probably be 
extended to use semantic groupings fairly simply, and 
thus provide consistency properties like Perspective. 
Recently, Cimbiosis [28] has also built on a view-style 
system of partial replication and topology independence, 
with a different consistency model. 


Cimbiosis also presents a sync tree which provides a dis- 
tributed algorithm to ensure connectedness, and routes 
updates in a more flexible manner. This sync tree could 
be layered on top of Perspective or PRACTIs consistency 
mechanisms to provide these advantages. 


We chose our approach over Cimbiosis because it does 
not require any device to store all files, while Cimbio- 
sis has this requirement. Many of the households in our 
contextual analysis did not have any such master device, 
leading us to believe requiring it could be a problem. 
Perspective also does not require small devices to track 
any information about the data stored on other devices, 
while PRACTI requires them to store imprecise sum- 
maries. However, there are advantages to each of these 
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approaches as well. For example, PRACTI provides a 
more flexible consistency model than Perspective, and 
Cimbiosis a more compact log structure. A full compari- 
son of the differences between these approaches, and the 
relative importance of these differences, is beyond the 
scope of this paper. We present Perspective’s algorithms 
to show that it is possible to build a simple, efficient con- 
sistency protocol for a view-based system. 


Previous peer-to-peer systems such as Bayou [40], FI- 
CUS [16] and Pangaea [30] extended synchronization 
and consistency algorithms to accommodate mobile de- 
vices, allowing these systems to blur or eliminate the dis- 
tinction between server and client. However, none of 
these systems fully support topology-independent con- 
sistency with partial replication. EnsemBlue [25] takes 
a middle ground, providing support for groups of client 
devices to form device ensembles [33], which can share 
data separately from a server through the creation of a 
temporary pseudo-server, but requiring a central server 
for consistency and reliability. 


Search: We believe that effective home data manage- 
ment will use search on data attributes to allow flexi- 
ble access to data across heterogeneous devices. Per- 
spective takes the naming techniques of semantic sys- 
tems and applies them to the replica management tasks 
of mobility and reliability as well. Naturally, Perspective 
borrows its semantic naming structures and search tech- 
niques from a rich history of previous work. The Seman- 
tic Filesystem [12] proposed the use of attribute queries 
to locate data in a file system, and subsequent systems 
showed how these techniques could be extended to in- 
clude personalization [14]. Flamenco [43] uses “faceted 
metadata,” a scheme much like the semantic filesystem’s. 
Many newer systems [3, 13, 19, 37] borrow from the 
Semantic Filesystem by adding semantic information to 
filesystems with traditional hierarchical naming. Mi- 
crosoft’s proposed WinFS filesystem also incorporated 
semantic naming [42]. 


Perspective also uses views to provide efficient dis- 
tributed search, by guiding searches to appropriate de- 
vices. The most similar work is HomeViews [11], which 
uses a primitive similar to Perspective’s views to allow 
users to share read-only data. HomeViews combines ca- 
pabilities with persistent queries to provide an extended 
version of search over data, but do not use them to target 
replica management tasks like reliability. 


Replica indices and publish/subscribe: In order to pro- 
vide replica coherence and remote data access, filesys- 
tems need a replica indexing system that forwards up- 
dates to the correct file replicas and locates the replicas 
of a given file when it is accessed remotely. Previous sys- 
tems have used volumes to index replicas [32, 34], but 
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did not support replica indexing in a partially replicated 
peer-ensemble. EnsemBlue [25] extended the volume 
model to support partially replicated peer-ensembles by 
allowing devices to store a single copy of all replica lo- 
cations onto a temporarily elected pseudo-server device. 
EnsemBlue also showed how its replica indexing system 
could be leveraged to provide more general application- 
level event notification. Perspective takes an inverse ap- 
proach; it uses a publish/subscribe model to implement 
replica indexing and, thus, application-level event notifi- 
cation. This matches the semantic nature of views. 


This work does not propose algorithms beyond the cur- 
rent publish/subscribe literature [1, 4, 6, 26, 38], it ap- 
plies publish/subscribe algorithms to the new area of 
file system replica indices. Using a publish/subscribe 
method for replica indexing provides advantages over a 
pseudo-server scheme, such as efficient ensemble cre- 
ation, but also disadvantages, such as requiring view 
changes to move replicas. Again, a full comparison of 
alternative approaches is beyond the scope of the paper. 
We present Perspective’s algorithms to show that replica 
indexing can be performed efficiently using views. 


User studies: While we believe our contextual analy- 
sis is the first focused on home data organization and 
reliability, researchers have conducted a wealth of stud- 
ies on technology use and management, especially in the 
home [2, 5, 9, 15, 17, 20, 22, 39]. We borrow our meth- 
ods from these previous studies, and use them to ground 
our exploration and analysis. 


8 Conclusion 


Home users struggle with replica management tasks that 
are normally handled by professional administrators in 
other environments. Perspective provides distributed 
storage for the home with a new approach to data loca- 
tion management: the view. Views simplify replica man- 
agement tasks for home storage users, allowing them to 
use the same attribute-based naming style for such tasks 
as for their regular data navigation. 
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Abstract 


This paper presents the design, implementation, and 
evaluation of BORG, a self-optimizing storage system 
that performs automatic block reorganization based on 
the observed I/O workload. BORG is motivated by three 
characteristics of I/O workloads: non-uniform access 
frequency distribution, temporal locality, and partial de- 
terminism in non-sequential accesses. To achieve its ob- 
jective, BORG manages a small, dedicated partition on 
the disk drive, with the goal of servicing a majority of the 
I/O requests from within this partition with significantly 
reduced seek and rotational delays. BORG is transparent 
to the rest of the storage stack, including applications, file 
system(s), and I/O schedulers, thereby requiring no or 
minimal modification to storage stack implementations. 
We evaluated a Linux implementation of BORG using 
several real-world workloads, including individual user 
desktop environments, a web-server, a virtual machine 
monitor, and an SVN server. These experiments compre- 
hensively demonstrate BORG’s effectiveness in improv- 
ing I/O performance and its incurred resource overhead. 


1 Introduction 


There is a continual increase in the gap between CPU 
performance and disk drive performance. While the 
steady increase in main memory sizes attempts to bridge 
this gap, the impact is relatively small; Patterson et 
al. [25] have pointed out that disk drive capacities and 
workload working-set sizes tend to grow at a faster rate 
than memory sizes. Present day file systems, which con- 
trol space allocation on the disk drive, employ static data 
layouts [5, 8, 15, 20, 22, 37]. Mostly, they aim to pre- 
serve the directory structure of the file system and opti- 
mize for sequential access to entire files. No file system 
today takes into account the dynamic characteristics of 
I/O workload within its data management mechanisms. 
We conducted experiments to reconcile past observa- 
tions about the nature of I/O workloads [7, 9, 30] in the 
context of current-day systems including end-user and 
server-class systems. Our key observations that motivate 
BORG are: (i) on-disk data exhibit a non-uniform access 
frequency distribution, the “frequently accessed” data is 
usually a small fraction of the total data stored when con- 
sidering a coarse-granularity time-frame, (ii) considering 


*The first three authors contributed equally to this work. 


a fine-granularity time-frame, the “on-disk working-set” 
of typical I/O workloads is dynamic; nevertheless, work- 
loads exhibit temporal locality in the data that they ac- 
cess, and (iii) I/O workloads exhibit partial determin- 
ism in their disk access patterns; besides sequential ac- 
cesses to portions of files, fragments of the block access 
sequence that lead to non-sequential disk accesses also 
repeat. We elaborate on these observations in § 2. 

While the above observations mostly validate the prior 
studies, and may even appear largely intuitive, surpris- 
ingly, there is a lack of commodity storage systems that 
utilize these observations to reduce I/O times. We believe 
that such systems do not exist because (i) key design and 
implementation issues related to the feasibility of such 
systems have not been resolved, and (ii) the scope of ef- 
fectiveness of such systems has not been determined. 

We built BORG, an online Block-reORGanizing stor- 
age system to comprehensively address the above issues. 
BORG correlates disk blocks based on block access pat- 
terns to capture the I/O workload characteristics. It 
manages a dedicated, BORG OPtimized Target (BOPT) 
partition and dynamically copies working-set data blocks 
(possibly spread over the entire disk) in their relative ac- 
cess sequence contiguously within this partition, thus si- 
multaneously reducing seek and rotational delays. In ad- 
dition, it assimilates all write requests into the BOPT par- 
tition’s write buffer. Since BORG operates in the back- 
ground it presents little interference to foreground appli- 
cations. Also, BORG provides strong block-layer data 
consistency to upper layers, by maintaining a persistent 
page-level indirection map. 

We evaluated a Linux implementation of BORG for 
a variety of workloads including a development work- 
station, an SVN server, a web server, a virtual machine 
monitor, as well as several individual desktop applica- 
tions. The evaluation shows both the benefits and short- 
comings of BORG as well as its resource overheads. 
Particularly, BORG can degrade performance when a 
non-sequential read workload suddenly shifts its on-disk 
working-set. For most workloads, however, BORG de- 
creased disk busy times in the range 6% to 50%, offering 
the greatest benefit in the case of non-sequential write- 
mostly workloads without tuning BORG parameters for 
optimality. A sensitivity study with various parameters 
of BORG demonstrates the importance of careful pa- 
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Workload File System | Memory Reads [GB] 
office 
developer 
SVN server 
web server 


Total 


Writes [GB] 
Unique 


File System Top 20% Partial 
accessed data access | determinism 





Table 1: Summary statistics of week-long traces obtained from four different systems. 


rameter choice which can lead to even greater improve- 
ments or degrade performance in the worst case; a self- 
configuring BORG is certainly a logical and feasible di- 
rection. Memory overheads of BORG are bound within 
0.25% of BOPT, but CPU overheads are higher. Fortu- 
nately, most processing can be done in the background 
and there is ample room for improvement. 

This paper makes the following contributions: (i) we 
study the characteristics of I/O workloads and show how 
the findings motivate BORG (8 2) , (ii) we motivate and 
present the detailed design and the first implementation 
of a disk data re-organizing system that adapts itself to 
changes in the I/O workload (§ 3 and 8 4), (iii) we present 
the challenges faced in building such a system and our 
solutions to it (§ 5), and (iv) we evaluate the system to 
quantify its merits and weaknesses (8 6). 


2 Characteristics of I/O Workloads 


In this section, we investigate the characteristics of mod- 
ern I/O workloads, specifically elaborating on those that 
directly motivate BORG. We collected I/O traces, down- 
stream of an active page cache, over a one-week pe- 
riod from four different machines. These machines have 
different I/O workloads, including office and developer 
desktop workloads, a version control SVN (Subversion) 
server, and a web-server. The office and developer 
workloads are single-user workloads. The former work- 
load was composed mostly of web-browsing, graph plot- 
ting with gnuplot, and several open-office applications, 
while the latter consisted of extensive development us- 
ing emacs, gcc, and gdb, document preparation using 
I4TgX, email, web-browsing, and updates of the oper- 
ating system. The SVN server hosted document and 
project code-base repositories for our 6-person research 
group. Finally, the web-server workload mirrored the 
web-requests made to our department’s production web- 
server on one of our lab machines and served 1.1 million 
web requests during the trace period. Key statistics for 
these workloads are summarized in Table 1. We define 
the on-disk working-set (henceforth also referred to sim- 
ply as “working-set”) of an I/O workload as the set of all 
unique blocks accessed in a given interval. 


2.1. Non-uniform Access Frequency Distribution 


Researchers have pointed out that file system data have 
non-uniform access frequency distribution [2, 29, 39]. 
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This was confirmed in the traces that we collected where 
less than 4.5-22.3% of the file system data were accessed 
over the duration of an entire week (shown in Table 1). 
We observe that the office and web server workloads are 
read mostly, while the developer and SVN server are 
write mostly. Figure 1 (top row) shows page access rank- 
frequency plots for the workloads; file system pages were 
4KB in size, composed of 8 contiguous blocks. A uni- 
form trend to be observed across the various workloads 
is that the really high frequency accesses are due write 
requests. However, and especially in the case of the read- 
mostly office and web server workloads, there are a large 
number of read requests that occur repeatedly. In either 
case (read or write), the access frequencies are highly 
skewed. Figure | (middle row) depicts disk heatmaps 
created by partitioning the disk into regions and mea- 
suring accesses to each region. The heatmaps indicate 
that accesses, both high and low frequency ones, in most 
cases are spread over the entire disk area. Skewed data 
access frequency is further illustrated in Table 1 — the 
top 20% most frequently accessed blocks contributed to a 
substantially large (~45-66%) percentage of the total ac- 
cesses across the workloads, which are within the ranges 
reported by Gémez and Santonja (Figure 2(a) in [7]) for 
the Cello traces they examined. 

Based on the above observations, it is reasonable to ex- 
pect that co-locating frequently accessed data in a small 
area of the disk would help reduce seek times when com- 
pared to the same data being spread throughout the entire 
disk area. Akyurek and Salem [2] have demonstrated the 
performance benefits of such an optimization via a sim- 
ulation study. This observation also motivates reorganiz- 
ing copies of popular blocks in BORG. 


2.2 Temporal Locality 


Temporal locality in I/O workloads is observed when the 
on-disk working-sets remain mostly static over short du- 
rations. Here, we refer to a locality of hours, days, or 
weeks, rather than seconds or minutes (typical of main 
memory accesses). For instance, a developer may work 
on a few projects over a period of a few weeks or months, 
typically resulting in her daily or weekly working sets 
being substantially smaller than her entire disk size. In 
servers, popularity of client requests result in temporal 
locality. A web server’s top-level links tend to be ac- 
cessed more frequently than content that is embedded 
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Figure 1: Rank-frequency, heatmap, and working-set plots for week-long traces from four different systems. 
The heatmaps (middle row) depict frequency of accesses in various physical regions of the disk, each cell representing 
a region. Six normalized, exponentially-increasing heat levels are used in each heatmap where darker cells represent 
higher frequency of accesses to the region. Disk regions are mapped to cells in row-major order. 


much deeper in the web-site; an important new revision 
of a specific repository on an SVN server is likely to be 
accessed repeatedly over the initial weeks. 

Figure | (bottom row) depicts the changes in the per- 
day working-sets of the I/O workload. The two end-user 
I/O workloads and the web server workload exhibit large 
overlaps in the data accessed across successive days of 
the week-long trace with the first day of the trace. There 
is substantial overlap even among the top 20% most ac- 
cessed data across successive days. Interestingly, these 
workloads do not necessarily exhibit a gradual decay in 
working-set overlap with day 1 as one might expect, in- 
dicating that popularity is consistent across multi-day pe- 
riods. The SVN server exhibits anomalous behavior be- 
cause periods of high commit activity degrade temporal 
locality (new data gets created), while periods of high 
update activity improve temporal locality. 

These observations indicate that optimizing layout 
based on past I/O activity can improve future I/O perfor- 
mance for some workloads and motivates planning block 
reorganization based on past activity in BORG. 


2.3 Partial Determinism 


Partial determinism in I/O workload occurs when certain 
non-sequential accesses in the block access sequence are 
found to repeat. A non-sequential access is defined by a 
sequence of two I/O operations that are addressed non- 
contiguous block addresses. It manifests in both end- 
user systems and servers. For instance, I/O during appli- 


cation start-up is largely deterministic, both in terms of 
the set of I/O requests and the sequence in which they 
are requested. Reading files related to a repeatable task 
such as setting up a project in an integrated development 
environment, compilation, linking, word-processing, etc. 
result in a deterministic I/O pattern. In a web-server, ac- 
cessing a web-page involves accessing associated sub- 
pages, images, scripts, etc., in deterministic order. 

In Table 1, we present the partial determinism for each 
workload calculated as the percentage of non-sequential 
accesses that repeat at least once during the week. The 
partial determinism percentages are high for the two end- 
user and the SVN server workloads. Further, for each of 
these workloads, there were a non-trivial amount of non- 
sequential accesses that repeated as many as 100 times. 
These findings suggest that there is ample scope for op- 
timizing the repeated non-sequential access patterns. 


3 Overview and Architecture 


BORG is motivated by the simple question: What stor- 
age system optimizations based on workload character- 
istics can allow applications to utilize the disk drive 
more efficiently than current systems do? This sec- 
tion presents the rationale behind the design decisions 
in BORG and its system architecture. 


3.1 BORG Design Decisions 


A Disk-based Cache. 
The operating system uses main memory to cache fre- 
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quently and recently accessed file system data to reduce 
the number of disk accesses incurred. In any given du- 
ration of time, the effectiveness of the cache is largely 
dependent on the on-disk working-set of the I/O work- 
load, and can degrade when this working-set increases 
beyond the size of the page cache. Storage optimiza- 
tions such as prefetching [16, 24, 33] and I/O schedul- 
ing [13, 26, 27, 32] help improve disk I/O performance 
in such situations. 


Using a disk-based cache as an extension of the main 
memory cache offers three complementary advantages 
in comparison to main memory caching alone, prefetch- 
ing, and I/O scheduling. First, it is more effective as a 
cache (than main memory) because it offers a less expen- 
sive (and thus larger) as well as reliable caching solution, 
thus allowing data to be cache-resident for long periods 
of time. Second, the size of the disk-based cache can 
easily be configured by the system administrator with- 
out changing any hardware. And finally, dynamically 
optimizing data layout based on access patterns within 
a disk-based cache provides the unique ability to make 
originally non-sequential data accesses more sequential. 


A Block Layer Solution. 

A self-optimizing storage solution can be built at any 
layer in the storage stack (shown in Figure 2). Block 
level attributes of disk I/O operations are not easily ob- 
tained at the VFS or the page cache layer. While file 
system layer solutions can benefit from semantic know]l- 
edge of blocks, they incur a significant disadvantage in 
being tied to a specific file system (and perhaps even ver- 
sion). Device driver encapsulations (interface at P4) are 
incapable of capturing upper layer attributes, such as pro- 
cess ID and request time-stamp due to I/O scheduler re- 
ordering and loss of process context. 


We contend that the block layer (interface at P3) is 
ideal for introducing block reorganization for several rea- 
sons. First, key temporal, block- and process- level at- 
tributes about disk accesses are available. Second, oper- 
ating at the block layer makes the solution independent 
of the file system layer above, allowing it the flexibility 
to support multiple heterogeneous file systems simulta- 
neously. Finally, new abstractions due to virtualization 
trends (e.g., virtual block device abstraction) as well as 
network-attached storage environments (SAN and NAS) 
can be supported in a straightforward way. In the case 
of SAN, BORG can reside on the client where all con- 
text for I/O operations are readily available with the un- 
derlying assumption that the SAN device’s logical block 
address space is optimized for sequential access. In the 
case of NAS, the BORG layer can reside within the NAS 
device where I/O context is readily available. Modifying 
the NAS interface to include process associations within 
file I/O requests can complete the profile information. 
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Using an Independent BOPT partition. 

The file system optimizes for sequential accesses to en- 
tire files, a common form of file access. However, 
certain workloads, including application start-up, con- 
tent indexing and web-page requests, exhibit a more non- 
sequential, but deterministic, access behavior. It is thus 
possible that the same set of data can be accessed sequen- 
tially by some applications and non-sequentially by oth- 
ers. Further, some deterministic non-sequential accesses 
may only be temporary phenomenon. 

Based on this observation, Akyurek and Salem [2] 
have argued in favor of copying rather than shuffling [29, 
39] of data. Copying retains original sequential layouts 
so a choice of location based on the observed access pat- 
tern may be possible. Reverting back to the original lay- 
out is straightforward. Similarly, rather than permanently 
disturbing the sequential layout of files, BORG operates 
on copies of blocks placed temporarily in an independent 
BOPT partition, optimizing for the current common case 
of access for each data block. 


3.2 BORG Architecture 
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Figure 2: BORG System Architecture. 


Abstractly, BORG follows a four-stage process: 
1. profiling application block I/O accesses, 


2. analyzing I/O accesses to derive access patterns, 
3. planning a modification to the data layout, and 


4. executing the plan to reconfigure the data layout. 
In addition, an I/O indirection mechanism runs contin- 
uously, re-directing requests to the partition that it opti- 
mizes as required. Figure 2 presents the architecture of 
BORG in relation to the storage stack within the oper- 
ating system. The modification to the existing storage 
stack is in the form of a new layer, which we term BORG 
layer, that implements three major components: the //O 
profiler, the BOPT reconfigurator and the I/O Indirector. 
A secondary throttle-friendly user-space component im- 
plements the analyzer and the planner stages of BORG 
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and performs computation and memory-intensive tasks. 
While profiling and indirection are both continuous pro- 
cesses, the other stages run periodically and in succes- 
sion culminating in a reconfiguration operation. 

For the I/O profiler, we use a low-overhead kernel tool 
called blktrace [3]. The analyzer reads the I/O trace 
collected by the profiler and derives data access patterns. 
Subsequently, the planner uses these data access patterns 
and generates a new reconfiguration plan for the BOPT 
partition, which it communicates to the BOPT reconfig- 
urator component. The user-space analyzer and planner 
components run as a low-priority process, utilizing only 
otherwise free system resources. Under heavy system 
load, the only impact to BORG is that generating the new 
reconfiguration plan would be delayed. 

The BOPT reconfigurator is responsible for the peri- 
odic reconfiguration of the BOPT partition, per the /ay- 
out plan specified by the planner. The reconfigurator is- 
sues low-priority disk I/Os to accomplish its task, mini- 
mizing the interference to foreground disk accesses. Fi- 
nally, the I/O indirector continuously directs I/O requests 
either to the FS partition or the BOPT partition, based on 
the specifics of the request and the contents of the BOPT. 


3.3. BOPT Space Management 
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Figure 3: Format of the BOPT partition. Each entry in 
the Write-Buffer and Read-Cache map tables is a 3-tuple 
of the form (FS LBA, BOPT LBA, valid bit). 


The OPtimized Target partition (BOPT) as managed 
by BORG is shown in Figure 3. To reduce head move- 
ment, we suggest that the BOPT partition be created 
adjoining the swap partition if virtual memory is used. 
BORG partitions the BOPT into three fragments: BORG 
Meta-data, Read-cache, and Write-buffer. The Read- 
cache and Write-buffer are further sub-divided into fixed- 
length segments which store both data and (valid/invalid) 
map entries for the segment. The in-memory indirec- 
tion map (elaborated in § 4.5) maintained by BORG is a 
union of all the segment map entries in the BOPT. The 
BOPT map entries are synchronously updated each time 
the in-memory map information changes. Additionally, 
the segment map in the write-buffer contains a “valid en- 
tries counter” to track space usage in the write buffer. 
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BORG BOPTpartition identifier. 
BOPT contains dirty data. 
BOPT size BOPT partition size. 


Read-cache info Offset and size of the Read-cache. 
Write-buffer info Offset and size of the Write-buffer. 
Fixed size of segments in the BOPT. 





Table 2: Borg meta-data. 


Table 2 depicts the BOPT meta-data fragment. It 
stores key persistent information that aid in the opera- 
tion of BORG. The BORG_REQUIRE bit is set when the 
BOPT contains data that requires to be copied back to the 
FS. If set, the operating system initiates BORG at boot 
time to ensure consistent data accesses. The remaining 
meta-data information is used to correctly populate the 
in-memory indirection map structure during BORG ini- 
tialization. 


4 Detailed Design 


In this section, we present the design details of BORG 
by elaborating on its individual components. 


4.1 I/O Profiler 


The //O profiler is a data collection component that is 
responsible for comprehensively capturing all disk I/O 
activity. The I/O profiler generates an //O trace that in- 
cludes the temporal (timestamp of the request), process 
(process ID and executable) and the block-level (address 
range and read/write mode) attributes. We use the Q 
events reported by blktrace [3], which capture the I/O 
requests queued at the block layer. These include all 
requests as issued by the file system(s), including any 
journaling and/or page destageing mechanisms. We de- 
fer further details to the blktrace work [3]. 





4.2 Analyzer 


The analyzer is responsible for summarizing the disk I/O 
workload. It first splits the I/O trace obtained from the 
profiler into multiple I/O traces, one per process. Each 
process trace is used to build a directed process access 
graph G;(V;, E;), where vertices represent LBA ranges 
and edges a temporal dependency (correlation) between 
two LBA ranges. The weight on an edge between ver- 
tices (u,v) represents the frequency of accesses (reads 
or writes) from u to v. The directed and weighted graph 
representation is powerful enough to identify repeated 
sequences of multiple non-sequential requests. 

Since multiple processes may access the same LBA, 
a single master access graph G(V, E), that captures all 
available correlations into a single input for the reconfig- 
uration planner is created (illustrated in Figure 4). The 
complexity of the merge process increases if two ver- 
tices (either within the same graph or across graphs) have 
overlapping ranges. This is resolved by creating multi- 
ple vertices so that each LBA is represented in at most 
one range vertex. While we omit the detailed algorithm 
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Figure 4: Building the master access graph. Vertices 
are defined by (start LBA, size of request). Since vertices 
r, and s, have overlapping LBAs, r, is split into two 
vertices in the master access graph, one with size 1 and 
the other with the overlapping 8, blocks, starting at LBA 
I with size 2. 


for vertex splitting and graph merging due to space con- 
straints, we point out that we reduce the complexity of 
the merge algorithm by keeping the vertices sorted by 
their initial LBA. The total time complexity for the ana- 
lyzer stage is given by O(n x 1), where n is the number 
of vertices and / is the size (in LBA) of the largest vertex 
in the graph. Once the merge operation is completed, the 
master access graph, G, is obtained. 


4.3 Planner 


The planner takes the master access graph, G, as input 
and determines a reconfiguration plan for the BOPT par- 
tition. It uses a greedy heuristic that starts by choosing 
for placement the most connected vertex, wu, i.e., with 
the maximum sum of incoming and outgoing edges (Fig- 
ure 5). Next it chooses the vertex v most connected (in 
one direction only, either incoming or outgoing) to w. If 
v lies on the outgoing edge of u, it is placed after u and 
if it lies on the incoming edge it is placed before. The 
next vertex to be placed is the one most connected to the 
group u Uv. This process is repeated until either all the 
vertices in G are placed, or the read cache in the BOPT is 
fully occupied, or the edges connecting to the unplaced 
vertices in the master graph have weight below a cho- 
sen threshold. If the graph contains disconnected com- 
ponents, each of these are placed as separate groups. The 
time complexity for the planner is O(n x Ig(m) + n?) 
where n is the number of vertices and m is the num- 
ber of edges; finding the most connected vertex takes 
O(n x lg(m)) time and finding the next vertex takes 
O(n) time . 


4.4 BOPT Reconfigurator 


The BOPT reconfigurator implements the plan created 
by the planner component by performing the actual data 
movement to realize the new configuration of the BOPT. 
This task is complicated primarily because of consis- 
tency and overhead concerns. Overhead is partially ad- 
dressed by issuing low-priority I/O requests for data lay- 
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Figure 5: Placing the master access graph. C is 
the most connected vertex and is chosen for placement 


first. Next, vertex B is placed after vertex C' since it 


is connected by an outgoing edge and has the highest 
weight of all the edges connected to C’. Next, vertex 
G is placed be fore vertex group C U B. The final se- 
quence of vertices from the lowest LBA to the highest is: 
L=(F,H,J,A,G,C,B,E, D]. 


out reconfiguration, making the use of a priority sched- 
uler a prerequisite. BORG ensures block data consis- 
tency between the FS and BOPT copies of data blocks 
by maintaining a persistent indirection map, termed the 
borg.map, that continuously tracks the most up-to-date 
location of a data block. This map is updated each time 
a block location changes. 

The reconfigurator copies blocks in three stages; out- 
going, where it copies all the dirty blocks that are no 
longer in the new plan back to the original file system 
(FS) location, relocate, where it copies blocks that have 
to be relocated within the BOPT, and incoming where 
it copies all the new blocks that have to be copied from 
the FS to the BOPT. A single data movement operation 
and the corresponding update on borg_map entry can be 
considered ‘atomic’ since any application write request 
to the source LBA during data movement is put on hold 
until after the movement is complete and the borg_map 
entry is updated. This ensures that an up-to-date version 
of data is always maintained by the file system. 


4.5 W/O Indirector 


The //O indirector operates continuously, redirecting file 
system I/O requests as required. An I/O request may be 
composed of an arbitrary number of pages. Each page 
request is handled separately based on (i) number of 
blocks that can be satisfied from the BOPT as per the 
borg.map entry, (iz) type of operation (read or write) 
and (ii) presence of a free page in the BOPT. 

For each I/O request larger than one page, the indirec- 
tor splits it into multiple per-page requests. If a map- 
ping exists for all the pages of the I/O request in the 
borg.map, the request is indirected to the BOPT. If no 
mapping exists, and the request is a read request, it is is- 
sued unchanged to the file system. If only some pages of 
a read I/O request are mapped and the mapped entries are 
clean, the entire I/O is indirected to the file system; this 
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optimization reduces the seek overhead incurred to serve 
the request partially from the BOPT and the rest from the 
FS. For a write request, when no mapping exists for any 
of the pages, the blocks are written to a write-buffer por- 
tion of the BOPT reserved for assimilating write requests 
(if space permits) along with an additional request for up- 
dating corresponding mapping entries in the borg_map. 
For partially-mapped writes, the mapped blocks are in- 
directed to their BOPT locations; the unmapped pages 
are also absorbed in the write-buffer, space permitting, 
otherwise these are issued to the FS. 


4.6 Kernel Data Structures 


The persistent data structure borg_map is implemented 
as a radix tree such that given an FS LBA, the BOPT 
LBA can be retrieved efficiently and vice-versa. It also 
maintains the dirty information for the BOPT LBAs. For 
every page of 4KB, BORG stores 4 bytes each for the for- 
ward and the reverse mapping and one dirty bit. If all the 
pages in the BOPT of size S GB are occupied, the worst 
case memory requirement is 2 x S MB (S MB for for- 
ward and reverse mapping each), and = MB for the dirty 
information. Thus, in the worst case, borg_map requires 
memory of 0.25% of the size of the BOPT partition, an 
acceptable requirement for kernel-space memory. 


5 Implementation Issues 


In this section, we discuss the particularly challenging 
aspects of the BORG implementation that help address 
data consistency and overhead. 


5.1 Persistent Indirection Map 


Since BORG replicates popular data in the BOPT space, 
the system must ensure that reads are always up-to- 
date versions of data, including after a clean shutdown 
or a system crash. BORG implements a persistent 
borg_map, which is distributed within read-cache and 
write-buffer segments of the BOPT. Map entries on-disk 
are updated (along with their in-memory version) each 
time the BOPT partition is reconfigured or when a new 
map entry is added to accommodate a new write ab- 
sorbed into the BOPT. Upon writes to an existing BOPT 
mapped block, its indirection entry in the in-memory 
copy of the reconfiguration map is marked as dirty, once 
the I/O is completed. To minimize overhead for BOPT 
writes, we chose not to maintain dirty information in the 
on-disk copy. Upon reboot after an unclean shut down, 
all entries in the persistent map are marked as dirty and 
future IOs to these blocks are directed to the BOPT. 


5.2 Optimizing Reconfiguration 

Consider a set L of n LBAs, Ly,--- , Ly, sequentially 
located in the BOPT space. L forms a chain if VL; € L, 
where L; 4 Ly, L; has to be relocated to location L; +1 
and L,, is an outgoing block. If Ly, has to be relocated to 
L, within the BOPT, L forms a cycle. Information about 


USENIX Association 


chains and cycles, that occur exclusively for the relocated 
blocks, can be used to further optimize data movement 
during the reconfiguration operation. If a cycle exists, it 
is broken by copying the last block L,, back to the FS 
(if dirty) and then deleting the plan entry for that block; 
an additional plan entry is then created to mark this as 
incoming block to L,. Next, all remaining blocks be- 
longing to the same chain/cycle are copied to their new 
locations in the BOPT. To do so, the reconfigurator is- 
sues all reads to the source locations in parallel; once all 
reads have been completed, it issues all the writes in par- 
allel, in each case allowing the I/O scheduler to optimize 
the request schedule. 


5.3 Other Data Consistency Issues 


BORG maintains metadata at the granularity of a page 
(rather than block) to reduce metadata memory require- 
ment (by 8X for Linux file systems). Consequently, the 
indirector must carefully handle I/O requests whose sizes 
are not multiples of the page-size and/or which are not 
page-aligned to the beginning of the target partition. We 
address this issue via I/O request splitting and page-wise 
indirection, techniques borrowed from our earlier work 
on EXCES [38], a block-layer extension that manages a 
persistent cache for reducing disk power consumption. 

BORG is dynamically included in the I/O stack by 
substituting the make_request function of the device 
targeted for performance optimization. While module in- 
sertion is straightforward, module removal/unload must 
ensure that all the data from the BOPT has been copied 
back to their original locations in the file system and han- 
dle foreground I/O correctly. Once again, BORG uses 
techniques from EXCES [38] and flushes dirty BOPT 
blocks to their original locations in the file system upon 
removal. To address race conditions caused when an ap- 
plication issues an I/O request to a page that is being 
flushed to disk, BORG stalls (via sleep) the foreground 
I/O operation until the specific page(s) being flushed are 
written to the disk. 


6 Evaluation 


In this section, we compare the performance of BORG 
with a vanilla system in which all the blocks are located 
in their original FS space under various workloads to an- 
swer the following questions. 

(i) How well does BORG perform? We use the total 
disk busy time (i.e., excluding all idle periods) as the pri- 
mary metric of performance. Due to BORG’s optimiza- 
tions, apart from the potentially improved head position- 
ing times, the degree of merging of requests may also be 
increased when compared with the vanilla configuration, 
thus changing the request pattern itself. Thus, the more 
common J/O response time metric is an ill-suited choice. 
The total disk busy time (henceforth simply referred to 
as disk busy time) is also robust against the trace-replay 
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Table 3: Experimental test-bed details. 


speedups we employ in some of our experiments. 

(ii) Why is BORG effective? We would like to know if 
BORG performance gains are because of the sequential- 
ity or the proximity of data (or both) in the BOPT. We use 
two metrics, average seek distance and non-sequential 
accesses percentage for this purpose. The latter is mea- 
sured as mes: Since non-sequential accesses are at 
least an order of magnitude less efficient than sequential 
accesses, even a small reduction in this metric may lead 
to substantial performance benefit. 

(iii) When is BORG not effective? BORG can degrade 
the system performance for certain workloads. We eval- 
uate BORG for varying workloads to determine in which 
cases it could perform worse than the vanilla system. 
(iv) How much CPU resource overhead does BORG in- 
cur? While the upper bound on memory overhead was 
examined in § 4.6, the CPU resources consumed by 
BORG should also be within acceptable limits. We use 
the execution times for various stages of BORG as an ap- 
proximate measure of CPU resource utilization. 

(v) How is BORG affected by its parameters? We per- 
form a sensitivity analysis of BORG to its parameters 
- reconfiguration interval, BOPT size, and BOPT write 
buffer fraction - to evaluate their impact on performance. 
Experimental Setup. All experiments were performed 
on machines running the Linux 2.6.22 kernels. We 
used host machines, 01 through O05, with differing hard- 
ware configurations and disk drives (Table 3). We used 
reiserfs for O1 and 03, and ext3 for the rest. No 
additional hardware was required to implement BORG. 

We conducted four different sets of experiments. The 
first set uses week-long traces of a developer’s system 
and a Subversion control server (SVN). The second ex- 
periment is an actual deployment of a web server that 
mirrors our CS department’s web server. The third ex- 
periment evaluates BORG performance in a virtual ma- 
chine environment. The fourth experiment evaluates the 
performance improvement due to BORG for application 
start-up events. 

In each experiment, we performed 4 reconfigurations 
equally spaced in time; this gave us a reasonable number 
of phases for detailed analysis. By not choosing more 
favorable times such as idle disk periods based on well- 
known diurnal workload cycles, we would only over- 
estimate the overhead of BORG during reconfiguration. 
We further discuss the selection of this parameter in § 6.5 
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Figure 6: Disk busy times in various phases of the SVN 
server trace replay. R; and N; correspond to recon- 


figuration phase 1 and non-reconfiguration phase j re- 


spectively. R3 and R, are beyond the y-axis range with 
values of 272 and 564 seconds respectively. 


and § 7. Finally, we use the notation R; and N; in var- 
ious graphs to denote reconfiguration phase 2 and non- 
reconfiguration phase 7 respectively. 


6.1 Trace Replay 


To evaluate BORG under realistic workloads, we con- 
ducted trace replay experiments using SVN server and 
developer workloads described in Table 1. For the traces 
and the replay, we used blktrace and btreplay respec- 
tively [3]. We used an acceleration factor of 168X that re- 
duces the experimentation time from one week to a man- 
ageable one hour after verifying that the resultant block 
access sequence was unaffected. The trace-playback 
acceleration factor was reverted to 1X during each re- 
configuration operation to accurately estimate reconfig- 
uration overhead. Since we only measure disk busy 
times, the comparison between normal and reconfigura- 
tions phases remains valid despite the varying accelera- 
tion factors. 


6.1.1 SVN Server 


For the SVN server trace replay, we used the host 02 (Ta- 
ble 3). The write buffer size was set to 20% of the BOPT 
size. Figure 6 shows the disk busy times during differ- 
ent phases of the experiment. In all the reconfiguration 
phases the busy time with BORG is notably higher than 
the vanilla case. This is due to substantial head move- 
ment during reconfiguration for relocating blocks. The 
longest reconfiguration phase lasted approximately 10 
minutes. Rz and Ry have substantially higher busy time 
than the previous two reconfigurations. After trace anal- 
ysis, we found that while the amount of data movement 
was similar across the four reconfiguration instances, in 
the latter two phases, the I/O scheduler merge ratio and 
the sequential disk accesses dropped dramatically; this 
can be attributed to the blocks relocated within the BOPT 
being spread out more than in the previous reconfigura- 
tions. However, As is evident by the vanilla busy times, 
the foreground activity during these intervals are negligi- 


USENIX Association 


USENIX Association 





700 

Vanilla Exxxx 
600 + BORG = | 
500 + +: 
400 - 


300 + 


100 - 
Tae a % 
Ye RE Bk 


Figure 7: Disk busy time in various phases of the de- 
veloper trace replay. 
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ble and thus the increased reconfiguration durations have 
little impact to foreground I/O. 

In all the non-reconfiguration phases, each of which 
lasted 1.75 days approximately, BORG offers better per- 
formance for foreground I/O than the vanilla configura- 
tion. In the best case (range N2), BORG decreases the 
disk busy time by approximately 45%. This is a sur- 
prising result, since as per Figure 1(c), the working-set 
for this workload undergoes rapid shifts. The expla- 
nation lies in the fact that the SVN server is a write- 
intensive workload and the BOPT write-buffer is suc- 
cessful in sequentializing a rapidly changing, possibly 
non-sequential, write workload. Analysis of the block 
level traces revealed that with BORG, the non-sequential 
access percentage reduced from 1.70% to 1.15%, and the 
average seek distance reduced from 704 to 201 cylinders 
during the non-reconfiguration phases. 


6.1.2 Developer 


For the developer trace replay, we used the host 01 (Ta- 
ble 3) with the BOPT write buffer set to 40% of the 
BOPT size. Figure 7 shows the disk busy time for this 
experiment in various phases. With this workload, the 
longest measured reconfiguration phases were R3 and 
R,4 which lasted approximately 7 minutes each. We ob- 
serve reduced disk busy times (13% to 50% reductions) 
across the non-reconfiguration periods, except for Ns 
which shows an increase of 25%. Overall, the developer 
workload is a write-mostly workload and thus, largely 
conducive to BORG optimizations. Analysis of the block 
level traces revealed that overall, the non-sequential ac- 
cess percentage reduced from 3.93% to 3.30%, and the 
average seek distance reduced from 1203 to 782 cylin- 
ders when using BORG. 


6.2 Web Server 


To evaluate BORG in a production server environment, 
we made a copy of the our Computer Science depart- 
ment web server on the 04 machine (see Table 3), and 
replayed all the web requests for a week. During this 
week a total of 1137234 requests to 256017 distinct files 
were serviced. We set BORG to reconfigure four times 


during this period, using an BOPT of 8GB (< 5% of the 
180GB web server file system). To measure the influence 
of the I/O history, we conducted two sets of experiments. 
In the first experiment, we used all the traces gathered 
from the beginning of the experiment as input to the re- 
configurator (cumulative). For the second, we only used 
the portion of the trace corresponding to the period since 
the last reconfiguration (partial). 
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Figure 8: Disk busy time for the week long web log 
replay. Borg-C and Borg-P correspond to using cumu- 
lative and partial traces respectively. 


Figure 8 shows the improvements in disk busy time 
across various non-reconfiguration and reconfiguration 
phases during the experiment. For both the cumula- 
tive and partial experiments, BORG reduces disk busy 
time in all non-reconfiguration phases with reductions 
ranging from 14% to 35% for cumulative and 5% to 
39% for the partial configuration, except Ns for cumu- 
lative which reported a 6% increase for cumulative due 
to drastic change in the last interval’s workload. Disk 
busy times in reconfiguration phases are typically higher 
due to the overhead of copying data to the BOPT. Nev- 
ertheless, BORG was able to obtain overall reductions of 
14% and 18% for cumulative and partial configuration. It 
is interesting to note that short term training yielded bet- 
ter results in this case, perhaps due to greater influence 
of short term locality. 
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Figure 9: BORG overhead. Bars C and P represent the 
cumulative and partial traces experiments respectively. R; in- 
dicates the ith reconfiguration. 


Next we examine operational overhead of BORG. Fig- 
ure 9 shows the amount of time spent in each phase of 
the reconfiguration. With cumulative traces, the time 
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required for the analyzer and planner phases increases 
linearly. While the planner and analyzer stages can run 
as low-priority tasks in the background, we must point 
out that the current implementation of BORG analyzer 
and planner stages are highly unoptimized and there is 
substantial room for improvement. We discuss possi- 
ble improvements for both subsystems in §7. With par- 
tial traces, the time increases until the second recon- 
figuration, but then decreases and stays almost constant 
for the following ones, indicating a gradually stabilizing 
working-set. 
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Figure 10: Differences in the reconfiguration plans. 


To explain this further, we examined the reconfigu- 
ration plan divided by the type of operation (refer to 
§ 4.4), presented in Figure 10. We note that the size 
of the plan consistently increases when using cumulative 
traces and most of the movements correspond to page re- 
locates, which are page movements within the BOPT it- 
self. The story is quite different for partial traces, where 
we see pages not accessed in the past interval leaving the 
BOPT, resulting in a smaller working set in the BOPT 
and thereby reducing the amount of work done by the 
analyzer, planner, and reconfigurator stages. 


6.3 Virtual Machines 


BORG has the potential to significantly improve the per- 
formance of virtualized environments, by co-locating 
multiple virtual machine (VM) localities spread across a 
physical volume. We evaluated the impact on the per- 
VM boot time and the overall performance of virtual 
machines by deploying BORG in a Xen [4] virtual ma- 
chine monitor. We created four VMs, each with 64MB 
memory and 4GB physical partition on the host 05 (refer 
to Table 3). For evaluating boot-time improvement, we 
trained BORG with the boot-time events of all the vir- 
tual machines. BORG showed an almost 3X average im- 
provement in VM boot-times - 167 seconds with vanilla 
and 65 seconds with BORG. 

To measure normal execution performance improve- 
ment for the VMs, we ran the Postmark benchmark 
which emulates an e-mail server and creates and up- 
dates small files. We set the number of files to be 2000 
in 500 directories and performed 200,000 transactions. 
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Figure 11: BORG with a VMM. 


Rand. I/O % | Avg seek (#cyl) 


App 


firefox 
cowriter 

xemacs 
acroread 

eclipse 


gimp 





coimpress 


Table 4: Application start-up time improvement. V: 
vanilla, B: BORG. 


We reconfigured BORG after every 20% of the bench- 
mark was executed with the training set including I/O 
operations from the start of the execution of the bench- 
mark. The results for the I/O performance are shown in 
Figure 11. As before, the reconfiguration phases see a 
increased disk busy times with BORG. For the normal 
operation, as the training set increases, the disk busy 
times with BORG starts decreasing. Overall, there is 
an average decrease of 6% in busy time during the non- 
reconfiguration phases. However, this improvement is 
not consistent; performance degrades substantially even 
during normal operation in the early stages of the bench- 
mark. The Joss of process context inside the VMM is a 
key problem that tends to convert sequentially laid out 
files into non-sequential upon reconfiguration. We be- 
lieve that making BORG aware of process context inside 
the VMM [14] can substantially improve the BOPT lay- 
out, resulting in much greater performance benefit. 


6.4 Application Start-up 


We evaluated the impact of BORG on IJ/O-bound start- 
up phase for common desktop applications using host 
O03. We first trained the system for a duration of approx- 
imately four hours, during which we invoked a subset of 
the applications listed in Table 4 (but specifically exclud- 
ing gimp and ooimpress) multiple times for perform- 
ing common office tasks. We invalidated the page cache 
periodically to artificially dilate time and simulate sys- 
tem reboots. Table 4 shows the difference in application 
start-up times, the percentage of sequential accesses and 
average seek overhead. For the applications that were 
used in training, it can be observed that there is a no- 
ticeable improvement in the I/O time with BORG - at 
least 43% for cowriter and up to 67% for eclipse. 
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Figure 12: A sensitivity analysis of BORG performance to its configurable parameters. 


Further, it is interesting to observe that although the per- 
centage of sequential I/Os decreases for cowriter and 
acroread with BORG, there is an overall improvement 
in I/O performance, possibly due to a reduction in the ro- 
tational overhead . There is barely any difference in the 
performance for untrained application gimp. However, 
although ooimpress was not used in the training, its 
start-up user-time shows an improvement of 62% in the 
average I/O time; this can be attributed to large shared 
libraries also used by the oowriter which was included 
in training. 


6.5 Sensitivity Analysis 


To gain maximum performance improvement with 
BORG its configurable parameters — the reconfiguration 
interval, the BOPT size, and the BOPT write buffer frac- 
tion — must be carefully tuned for a given workload. To 
better understand the effects of these parameters, we re- 
played the developer and the SVN workload traces on 
host O1 varying each of these parameters over a range 
of values. In all the experiments, the trace replay be- 
gins at the same starting point, that is after a base re- 
configuration, which uses the first six hours of the trace 
as the training period. We measure the relative effi- 
ciency of disk I/O using BORG averaged across the non- 
reconfiguration intervals by reporting the improvement 
in disk busy time throughput (referred to henceforth as 
“throughput improvement’) when compared to a vanilla 
system. 


6.5.1 Reconfiguration Interval 


Figure 12 (left) shows the percentage improvement over 
the vanilla system. The reconfiguration interval is varied 
from 8 hours (18 reconfigurations) to 3 days (1 reconfig- 
uration). To bootstrap the sensitivity analysis, the BOPT 
size is fixed to 1GB, with 50% reserved for write buffer- 
ing in this experiment. For the developer workload, as 
the reconfiguration interval increases the throughput in- 
creases, the training set becomes larger, and BORG can 
more effectively capture the working-set. For the SVN 
workload, the performance decreases for higher inter- 
vals. This is because the SVN working-set changes quite 
frequently (elaboration in § 2 and Figure 1(c)). 


6.5.2 BOPT size 


We use the best-case reconfiguration intervals of 3 days 
for the developer and a day for the SVN workload from 
the previous experiment. We vary the BOPT size from 
256MB to 8GB, of which the write buffer is always cho- 
sen as 50% of the BOPT size. Figure 12 (middle) shows 
that as the BOPT size increases, BORG’s performance 
with the developer workload increases since the devel- 
oper workload has a larger working set. When most 
of the blocks in the working set can be accommodated 
in the BOPT, the performance improvement stabilizes. 
Since the working set size for the SVN workload is rel- 
atively smaller, the performance improvement is almost 
same for the BOPT sizes >256MB. 


6.5.3. Write Buffer Variation 


From our previous results, we pick an interval of 3 days 
and 1 day and BOPT size of 2GB and 4GB for the devel- 
oper and the SVN workloads respectively. We vary the 
write buffer from 0-100%. Figure 12 (right) shows that 
for the developer workload, not having a write buffer re- 
sults in the lowest throughput. There is a steady increase 
in performance, peaking at 50% write buffer. Thereafter, 
it starts falling since read performance begins to degrade 
due to lesser available read cache. For the write-intensive 
SVN workload, the performance increases with increase 
in the write buffer size, since all the writes can be co- 
located in the BOPT partition. 


Configuring BORG parameters The above experi- 
ments indicate that configuring parameters incorrectly 
can lead to sub-optimal performance improvements with 
BORG. Fortunately, iterative algorithms can be easily 
employed to identify better parameter combinations in a 
straightforward way. Exploring such iterative algorithms 
more formally is one aspect of our future work. 


7 Discussion 


While our experiences with BORG have been mostly 
positive, there are several directions in which the current 
version can be either improved or extended. We now dis- 
cuss some of the significant directions that can serve as 
subjects of future investigation. 


Analyzer and Planner optimization. The current ver- 
sions of the analyzer (§ 4.2) and the planner (§ 4.3) com- 
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ponents of BORG do not use the results of past execu- 
tions and therefore incur higher overheads for every sub- 
sequent reconfiguration when using cumulative traces for 
training. Each of these components can be substantially 
optimized by making them more intelligent. The ana- 
lyzer can build the master access graph incrementally 
rather than from scratch; likewise, the planner can incre- 
mentally create the new plan for BOPT reconfiguration 
during each iteration. 


Alternate BOPT layout strategies. The current version 
of BORG uses a simple BOPT layout strategy starting 
from the most-connected vertex — the vertex with the 
highest sum of its edge-weights — in the master access 
graph, and then choosing the vertex most connected to 
it, and so on. Alternate layout strategies can be envi- 
sioned that potentially yield greater benefit. For instance, 
the placement can begin with the nodes connected to the 
highest weight edge, and then resorting to the same incre- 
mental addition of vertices. Alternatively, a distributed 
layout algorithm can be designed which uses many start- 
ing points for building the layout. 


Timely reconfiguration. The current reconfiguration 
trigger in BORG is based on a fixed interval. However, 
opportune times for performing reconfiguration are dur- 
ing periods of no or low foreground I/O activity, espe- 
cially for workloads that exhibit obvious idle or peak pe- 
riods of activity. More sophisticated triggers can use al- 
ternate metrics to identify “unwanted” or “much needed” 
reconfiguration, such as the BOPT hit rate or the per- 
centage of sequential accesses pre- and post- indirection 
to evaluate the effectiveness of the current BOPT layout. 
The above techniques can help substantially reduce the 
impact of reconfiguration to foreground I/O and increase 
the effectiveness of each reconfiguration operation. 


Avoiding performance degradation. BORG can de- 
grade performance for certain workloads, for instance, 
a read-intensive workload that has a very large or unsta- 
ble working-set (§ 6.2). Future versions of BORG can be 
made intelligent to measure the impact of reconfigura- 
tion on such workloads by comparing the percentage se- 
quentiality and the spatial locality for the accesses before 
(vanilla) and after (BORG) the indirection operation. If 
these metrics degrade post-BORG, BORG can be dis- 
abled. Such a mechanism will allow system performance 
to degrade gracefully in the event that the workload is not 
conducive to benefit from block reorganization. 


8 Related Work 

We examine related work by organizing the literature 
into block and file based approaches. 

8.1 Block level approaches 


Early work [41] on optimized data layout argued for 
placing the frequently accessed data in the center of 
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the disk. Vongsathorn et al. [39] and Ruemmler and 
Wilkes [29] both propose Cylinder Shuffling. Ruemm- 
ler and Wilkes specifically demonstrated that perform- 
ing relatively infrequent shuffling led to greater improve- 
ment in I/O performance. In Akyurek and Salem’s 
work [2], the authors demonstrated the advantages of 
copying over shuffling and the importance of reorganiza- 
tion at the block (rather than cylinder) level. These early 
data clustering approaches emphasized on process- and 
access-pattern- agnostic block counts to perform the data 
reorganization and reported simulation-based results. 

Researchers have also investigate self-optimizing 
RAID systems. Wilkes et al. proposed HP Au- 
toRAID [40], a controller-based solution, that transpar- 
ently adapts to workload changes by using a two-level 
storage hierarchy; the upper level provides data redun- 
dancy for popular data while the lower level provides 
RAID 5 parity protection for inactive data. Work on ea- 
ger writing [42] and distorted mirrors [35] address mir- 
rored/striped RAID configurations primarily for database 
OLTP workload (which are characterized by little local- 
ity or sequentiality) that choose to write to a free sec- 
tor closest to the head position on one more disk drives. 
While we are yet to explore BORG’s use in multi-disk 
systems, the optimizations used in BORG are differ- 
ent and mostly complementary to the above proposals, 
whereby BORG attempts to capture longer-term on-disk 
working-sets within a dedicated volume. 

Hu et al.’s work on Disk Caching Disk [10] uses an ad- 
ditional logging disk (or disk partition) to perform writes 
sequentially and subsequently, destage to their original 
locations. Write buffering in BORG is slightly different 
in that writes to data already in the BOPT partition are 
written in place. The DCD work does not optimize for 
data read operations; BORG optimizes reads as well so 
head movement is substantially restricted. 

Among recent work on block reorganization, C- 
Miner [17] uses advanced data mining techniques to 
mine correlations between block I/O requests. These 
techniques can be utilized in BORG to infer complex 
disk access patterns. The Intel Application Launch Ac- 
celerator [12] reorganizes blocks used during application 
start-up to be more sequential, but does not provide a 
generic solution to improve overall disk I/O performance 
of the system. 

For throughput improvement, Schindler et al. have 
proposed free-block scheduling [18] and track-aligned 
extents [31] which use intelligent I/O scheduling rather 
than block reorganization. These are complementary 
techniques that can be used in conjunction with BORG. 

Among block level approaches, our work is closest to 
ALIS [9], wherein frequently accessed blocks as well as 
block sequences are placed sequentially on a dedicated, 
reorganized area on the disk. There are key differences 
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in design and implementation, though. First, BORG in- 
curs reduced space, maintenance, and metadata overhead 
since it maintains at most one copy of each data block. 
The multiple replicas in ALIS can become stale quickly 
in write-intensive workloads. Further, unlike BORG, 
ALIS does not optimize write traffic. Finally, the evalu- 
ation of ALIS techniques is performed using a disk sim- 
ulator with trace playback. On the other hand, we imple- 
ment and evaluate an actual system, thereby having the 
opportunity to address a greater detail of system imple- 
mentation issues. 


8.2 File level approaches 


In one of the early file oriented approaches, Staelin et 
al. [36] proposed monitoring file accesses and mov- 
ing frequently accessed files (entirely) to the center of 
the disk. Log-structured file systems (LFS [28]) offer 
superior performance for workloads with large number 
of small writes by batching disk writes to the end of a 
disk-sequential Jog. BORG writes all data to the BOPT 
partition to achieve a similar effect, but also attempts to 
co-locate a majority of read operations with the writes. 
Matthews et al. [19] proposed an optimization to LFS by 
incorporating data layout reorganization to improve read 
performance. Their use of block access graphs is similar 
to the process access graphs used in BORG. Their LFS- 
specific solution moves blocks within the LFS partition 
storing exactly one copy of each block at any time. Since 
BORG stores two copies, it can optimize for sequential 
and application-driven deterministic, non-sequential ac- 
cesses simultaneously. 

Researchers have also explored data- and application- 
specific layout mechanisms. Ganger and Kaashoek [6] 
advocate co-locating inodes and file blocks for small 
files. Conversely, PLACE [23], exposes the underly- 
ing layout structure to applications, so they can perform 
custom data placement. Sivathanu et al. [34] propose 
semantically-smart disk systems (SDS) that infer file sys- 
tem semantic associations for blocks, subsequently used 
for aligning files with track boundaries. Windows 
XP [21] uses the defragmenter for co-locating temporally 
correlated file data for speeding up application start-up 
events. BORG is a generic solution in comparison to the 
above approaches, since it creates a block reorganization 
mechanism that can adapt to an arbitrary workload. 

Mac OS’s HFS Plus [1] uses adaptive hot file cluster- 
ing to migrate and sequentially store hot files of small 
sizes near the volume’s metadata. In contrast, BORG 
operates at the block layer and sequentializes by copying 
(rather than migrating) hot block sequences, which may 
span either partial or multiple files. 

Among file level approaches, BORG is closest to the 
FS2 [11]. FS2 proposes replication of frequently ac- 
cessed blocks based on disk access patterns in file sys- 


tem free space. This strategy, unfortunately, also restricts 
the degree of seek and rotational-delay optimization due 
to the distribution of free space. Since FS2 may cre- 
ate multiple copies of a block simultaneously, staleness, 
and consequently, space and I/O bandwidth wastage, be- 
come important concerns (similar to those in ALIS); 
BORG maintains at most one extra copy of each block 
and its strength is in being a non-intrusive, storage-stack 
friendly, and file system independent (portable) solution. 


9 Conclusions and Future Work 


We presented BORG, a self-optimizing layer in the stor- 
age stack that automatically reorganizes disk data layout 
to adapt to the workload’s disk access patterns. BORG 
was designed to optimize both read and write traffic 
dynamically by making reads and writes more sequen- 
tial and restricting majority of head movement within 
a small optimized disk partition. A Linux implemen- 
tation of BORG was evaluated and shown to offer per- 
formance gains in the average case for varied work- 
loads including office and developer class end-user sys- 
tems, a web server, an SVN server, and a virtual ma- 
chine monitor. Disk busy time reductions with BORG 
across these workloads during non-reconfiguration in- 
tervals range from 6% (for the VM workload) to 50% 
(for the developer server workload), with even greater 
improvements possible with careful parameter selection 
within BORG. 

BORG performs occasionally worse than a vanilla 
system, specifically when a read-mostly workload dras- 
tically shifts its working set. BORG is able to eas- 
ily address changing working-sets with a (possibly non- 
sequential) write workload, since it has the ability to ab- 
sorb and sequentialize writes inside the BOPT. A sensi- 
tivity analysis revealed the importance of choosing the 
right configuration parameters for reconfiguration inter- 
val, BOPT size, and the write-buffer fraction. Fortu- 
nately, simple iterative algorithms can be quite effective 
in identifying the right parameter combination; a formal 
investigation of such an approach is an avenue for fu- 
ture work. The memory and CPU overheads incurred by 
BORG are modest, and with ample scope for further op- 
timization. In summary, we believe that BORG offers a 
novel and practical approach to building self-optimizing 
storage systems that can offer large I/O performance im- 
provements in commodity environments. 
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Abstract 


HYDRAstor is a scalable, secondary storage solution 
aimed at the enterprise market. The system consists of 
a back-end architectured as a grid of storage nodes built 
around a distributed hash table; and a front-end consist- 
ing of a layer of access nodes which implement a tradi- 
tional file system interface and can be scaled in number 
for increased performance. 

This paper concentrates on the back-end which is, 
to our knowledge, the first commercial implementa- 
tion of a scalable, high-performance content-addressable 
secondary storage delivering global duplicate elimina- 
tion, per-block user-selectable failure resiliency, self- 
maintenance including automatic recovery from failures 
with data and network overlay rebuilding. 

The back-end programming model is based on an ab- 
straction of a sea of variable-sized, content-addressed, 
immutable, highly-resilient data blocks organized in a 
DAG (directed acyclic graph). This model is exported 
with a low-level API allowing clients to implement new 
access protocols and to add them to the system on-line. 
The API has been validated with an implementation of 
the file system interface. 

The critical factor for meeting the design targets has 
been the selection of proper data organization based on 
redundant chains of data containers. We present this or- 
ganization in detail and describe how it is used to deliver 
required data services. Surprisingly, the most complex 
to deliver turned out to be on-demand data deletion, fol- 
lowed (not surprisingly) by the management of data con- 
sistency and integrity. 


1 Introduction 


The enterprise environment places strenuous demands 
on the secondary storage systems. With ever increas- 
ing amounts of data produced and fixed backup win- 
dows, there is a clear need for scaling performance and 


backup capacity appropriately. Different types of data 
have varying importance which require different classes 
of reliability and availability and have specific retention 
periods. Regulatory requirements (SOX, HIPPA, the Pa- 
triot Act, SEC rule 17a-4(t)) demand security, traceabil- 
ity and data auditing. Strict data retention and deletion 
procedures need to be defined and followed rigorously. 
Failure to present retained data on demand can result in 
serious business losses, fines and even criminal prosecu- 
tion. Last but not least, limited IT budgets increase the 
importance of providing efficient storage by improving 
storage utilization for backup and archival applications, 
and by reducing the data management costs. 


Substantial progress has been made to address these 
enterprise needs, as demonstrated by advanced disk- 
targeted deduplicating Virtual Tape Libraries [2, 3], disk- 
based back-end servers [43] and content-addressable 
archiving solutions [4]. However, the exponential in- 
crease in the amount of data stored creates new prob- 
lems not addressed by these solutions. First of all, unlike 
primary storage, which is usually networked and under 
common management (e.g. SANs), secondary storage 
consists of a large number of highly-specialized dedi- 
cated components, each of them being a storage island 
requiring customized, elaborate, and often manual ad- 
ministration and management. As a result, large frac- 
tion of the total cost of ownership (TCO) can still be at- 
tributed to management of more and more of secondary 
storage components [1, 17, 21]. Moreover, fixed capac- 
ity assignment to each storage device results in poor ca- 
pacity utilization. Duplicate elimination in these islands 
of storage is similarly limited in scope which compounds 
the inefficiency. Finally, since each of secondary storage 
devices offers fixed, limited performance, reliability and 
availability, the high overall requirements of enterprise 
secondary storage in these dimensions can be met only 
by implementing complex in-house solutions. 


Fortunately, new technology and previous research 
results provide building blocks for a solution address- 
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ing these problems. The content-addressable storage 
paradigm [4, 24, 43] enables cheap and safe implemen- 
tation of duplicate elimination. Distributed hash ta- 
bles [12, 20, 25, 27, 31, 42] allow for building scalable, 
failure-resistant systems and extending duplicate elimi- 
nation to a global level. Erasure codes can add resiliency 
to the stored data with fine-grain control between re- 
quired resiliency level and resulting storage overhead. 
Hardware and pricing trends are also critical for en- 
abling HYDRAstor. The capacity of SATA drives and 
the performance of new multi-core CPUs increase even 
as the costs of these components fall. Together, these 
trends provide the building blocks needed for systems 
like HYDRAstor at a very reasonable cost. 

Other work applicable include research on self- 
management [13, 33], monitoring [35], and on-line re- 
configuration [30] and upgrade [7]. Although all of 
these elements facilitated building HYDRAstor, the task 
proved to be much more complex than we originally en- 
visioned and required a significant amount of original 
research. The effort often felt like trying to design and 
construct a building given just bricks and stones. 

HYDRAstor [23] is a commercial secondary storage 
solution for the enterprise addressing shortcomings dis- 
cussed earlier. It consists of a back-end architectured as 
a grid of storage nodes delivering scalable capacity and a 
front-end consisting of a layer of access nodes scaled for 
performance. In this paper, we concentrate on the design 
of the back-end grid which supports capacity sharing be- 
tween all clients and types of data, for example, back 
up images or archival data. This sharing together with 
system-wide deduplication allow for highly efficient use 
of storage capacity. The system is highly-available, as it 
supports on-line extensions and upgrades, tolerates mul- 
tiple disk, node and network failures, rebuilds the data 
automatically after failures and informs users about re- 
coverability of the deposited data. The reliability and 
availability of the stored data can be dynamically ad- 
justed by the clients with each write, as the back-end 
supports multiple data resiliency classes. 

This paper makes the following contributions. First, it 
presents the HYDRAstor as a concrete commercial im- 
plementation of scalable secondary storage system ad- 
dressing today’s enterprise needs. Second, it discusses 
in detail the HYDRAstor data organization and how it 
is used to implement advanced data services like global 
duplicate elimination, on-demand deletion, and data in- 
tegrity management. Third, it contains an evaluation of 
the HYDRAstor that demonstrates effectiveness of its 
implementation. 

The remainder of this paper is organized as follows. 
Section 2 describes the system’s functionality including 
the programming interface. Section 3 contains a high- 
level discussion of the back-end design. It establishes 
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context for the next section, 4, which discusses require- 
ments on data organization and the resulting solution. 
Section 5 illustrates how this organization is used to de- 
liver data services like data rebuilding and distributed 
data deletion. Section 6 presents evaluation of the sys- 
tem. Related work is discussed in Section 7, whereas 
conclusions and future work are given in Section 8. 


2 Functionality 


The back-end has been designed as a vast data repository, 
allowing for storing and extracting streams of data with 
high throughput. Internally, it consists of a potentially 
large number of independent nodes presented externally 
as a single system image. The back-end is designed to 
scale up to thousands of dedicated nodes which could 
provide hundreds of petabytes of storage. The primary 
deployment target is the data center. 

From the beginning, the HYDRAstor back-end was in- 
tended to provide a foundation for a commercial product. 
Therefore, one of the design targets has been to support 
not only tailor-made new applications, but also commer- 
cial legacy applications, as long as they use streamed 
data access. To that end, the system does not define 
one fixed access protocol, instead it is flexible to allow 
support for legacy applications using standards like file 
system interface as well as for new applications using 
highly-specialized access methods. New protocols can 
be dynamically added to an online system by loading a 
new protocol driver without disrupting any client using 
the existing protocols. 

One of the primary design goals has been to ensure 
continuous operation of the system, limiting or elimi- 
nating impact of upgrades, extensions and failures. The 
distributed architecture enhances system availability by 
allowing online software or hardware upgrades in most 
cases, eliminating the need for costly downtime. More- 
over, the system is capable of automatic self-recovery in 
case of hardware failures (disk, network, power loss), 
and even from some of software failures. The system 
works correctly in the presence of up to a specific con- 
figurable number of fail-stop and intermittent hardware 
failures. The system does not handle Byzantine failures 
which have a very low probability of actually occurring 
in a real data center and would add significant overhead. 
However, the system has several layers of data integrity 
checking to detect data corruption. 

Another important function of the system is to en- 
sure high data reliability, availability and integrity. Each 
block of data is written with a user-selected resiliency 
level which allows the user to choose how many concur- 
rent disk failures the block can survive. This is achieved 
with erasure coding each block into fragments; as shown 
in [36] erasure codes increase mean time to failure by 
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many orders of magnitude over simple replication for 
the same amount of space overhead. After a failure, if a 
block remains readable, the system automatically sched- 
ules data rebuilding to bring the resiliency back to the 
level requested by the user. No permanent data loss re- 
mains hidden for long. Global state indicates whether all 
stored blocks are readable, and if so, how many disk and 
node failures must happen before data loss occurs. 

Secondary storage systems have unique characteristics 
which influence the design goals. In contrast to primary 
storage, which often deals with random accesses, these 
systems are dominated by writes of long data streams. 
Given the scale of the system, multiple streams will 
be written concurrently by different clients. Successive 
streams are often similar to previously written streams 
which can contain many duplicate blocks. Since all data 
must be saved during short backup windows, very high 
write throughput is essential. Read throughput is quite 
important for restores, but it is not as critical as write 
throughput in our system since the restores are typically 
much less frequent and involve reading only a portion of 
the stored data. 


2.1 Programming Model 


The back-end programming model is based on an ab- 
straction of a sea of variable-sized, content-addressed, 
immutable, highly-resilient blocks. A block consists of 
data and, optionally, an array of block addresses, point- 
ing to previously written blocks. A block’s address is 
derived from the SHA-1 hash of its content (both data 
and pointers). Blocks are variable-sized to allow for bet- 
ter deduplication; and pointers are exposed to facilitate 
data deletion implemented as garbage collection. The 
back-end exports a low-level block interface used to im- 
plement new and legacy protocols. This interface allows 
for a clean separation of the back-end from the front-end 
which can support a wide range of access protocols. 
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Figure 1: Blocks organized in a directed acyclic graph. 
Data part of each block is shaded, pointers are not. 


Blocks in the back-end form a DAG (directed acyclic 
graph), as illustrated by Fig. 1. Drivers write trees of 
blocks, but because of deduplication, these trees over- 
lap at deduplicated blocks and form directed graphs. No 
cycle is possible in these structures as long as the hash 
used in the block address is secure. A source vertex in a 
DAG is usually a block of a special type called search- 
able retention root. In addition to the regular data and 
the array of addresses, a retention root contains a user- 
defined search key used to locate the block. This key can 
be arbitrary data. A user retrieves a searchable block by 
providing its search key instead of a cryptic block content 
address. For example, multiple snapshots of the same file 
system can have each root organized as a searchable re- 
tention root with search key containing file system name 
and a counter incremented with each snapshot. Search- 
able blocks do not have user-visible addresses and can- 
not be pointed to, so they cannot be used to create cycles 
in block structures. However, each searchable block has 
an internal hashkey assigned to it for fast retrieval. Un- 
like regular blocks, the hashkey of a searchable block is 
computed only over the search key portion of the block’s 
data. 

Fig. | shows a set of blocks organized into a DAG with 
3 source vertices, 2 of them are retention roots; the 3rd 
source vertex is a regular block, which indicates that this 
part of the DAG is still under construction. 

The API operations include writing and reading regu- 
lar blocks, writing searchable retention roots, searching 
for a retention root based on its search key; and mark- 
ing a retention root with a specified key to be deleted by 
writing an associated deletion root, as discussed below. 
Cutting the data stream into blocks is beyond this inter- 
face and is responsibility of the drivers, although we plan 
to re-evaluate this decision soon. 

When writing a block, a driver assigns it to one of a 
few available resiliency classes. Each class represents 
a different tradeoff between data resiliency and storage 
overhead: from the low resiliency data class where a 
block can survive only a single disk failure but has mini- 
mum storage overhead, up to the critical data class where 
each block can be replicated multiple times on different 
disks and physical nodes. Different resiliency classes are 
achieved by varying the number of original fragments in 
the erasure coding scheme (described later). 

The system does not provide a way to delete a sin- 
gle block immediately because this block may be refer- 
enced by other blocks. Instead, the API allows to mark 
roots of the DAG(s) which should be deleted. To mark 
a retention root as dead, a user writes a special block 
called searchable deletion root with the search key iden- 
tical to this retention root’s search key. In Fig. 1, there 
is a deletion root associated with the retention root SP1. 
The deletion algorithm marks for deletion all blocks not 
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reachable from the live retention roots, for example in 
Fig. | all blocks with dotted lines will be marked. The 
block named A will also be deleted because there is no 
retention root pointing to it, whereas the block named F 
will be retained, as it is reachable from the retention root 
SP2 which is still defined as live since it does not have a 
matching deletion root. 


During data deletion, there is a short read-only period, 
in which the system identifies blocks to be deleted. Ac- 
tual space reclamation happens in the background during 
regular read-write operation. Before entering a read-only 
phase, all blocks to be retained should be pointed by live 
retention roots. 


3 System Architecture 


HYDRAstor back-end nodes are built of highly reliable 
server-grade components. No customized hardware is 
needed. Detailed description of available hardware con- 
figurations is given in Section 6. The number of storage 
nodes determines the total raw capacity of the system as 
well as its maximal level of performance. Front-end ac- 
cess nodes can be added to realize this performance up 
to the limit determined by the current back-end configu- 
ration. 


Software components of the back-end include the stor- 
age server and proxy server, both implemented as Linux 
user space processes, and protocol drivers implemented 
as libraries. 


Storage servers are organized in an overlay network, 
with data blocks assigned to each server based on block’s 
hashkey. The details of the overlay are discussed in Sec- 
tion 3.1. Each storage node hosts one or more storage 
servers. The number of storage servers running on a stor- 
age node depends on its resources. The bigger the node, 
the more servers we run, with each server responsible 
exclusively for a specific number of this node’s disks. 
Putting multiple servers on one physical node is a simple 
solution to the problem of harnessing computing power 
of multicore CPUs. 


Proxy servers run on access nodes and export the same 
block API as the storage servers. A proxy provides ser- 
vices like locating the storage nodes, optimized message 
routing and caching. 


Protocol drivers use the API exported by the back- 
end to implement access protocols. These drivers can be 
loaded in the runtime on both storage and proxy servers. 
Location of a driver depends on available resources and 
driver resource needs. Usually, resource-hungry drivers 
like the file system driver are loaded on proxy servers. 
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3.1 Network Overlay 


Since one of our design goals has been scalability, the 
use of distributed hash tables has been a natural choice. 
However, because for a distributed storage system both 
storage utilization and data resiliency are extremely im- 
portant, we have had additional requirements on a DHT: 
assurances about storage utilization and ease of integra- 
tion of the selected overlay network with the data re- 
siliency scheme we have planned to use, i.e. erasure cod- 
ing. Since none of the existing DHTs allowed for that, 
we have decided to use a modified version of the Fixed 
Prefix Network (FPN) [12] distributed hash table. FPN 
makes it possible to maintain very short routing paths for 
a wide range of the number of nodes and guarantees a 
minimal level of storage utilization. 


empty prefix 





nodel 





node2 





node3 








nodeS 





node4 


node6 





' 
\ 


Figure 2: Supernodes and components. 4 supernodes 
spanned over 6 physical nodes. Each supernode has 4 
components, i.e. supernode cardinality is 4. 


In FPN, each overlay node is assigned exactly one 
hashkey prefix used also as an identifier of this virtual 
node. An FPN node is responsible for hashkeys with pre- 
fix equal to this node identifier. All possible hashkeys 
form a hashkey space. The overlay network strives to 
keep prefixes disjoint and to cover completely this space, 
which we call also prefix space. The upper part of Fig. 2 
shows a prefix tree which has four leaf FPN nodes, di- 
viding the prefix space into four disjoint subspaces. 

To meet our DHT requirements, we have extended the 
original FPN with supernodes. A supernode represents 
one FPN node (and as such, it is identified with a hashkey 
prefix), but spans several physical nodes to increase re- 
siliency to node failures. Each supernode consists of a 
fixed number (called supernode cardinality) of supern- 
ode components. Components of the same supernode are 
called peers and are usually placed on separate physical 
nodes, as show on Fig. 2. Practical supernode cardinal- 
ity values are in the 4-32 range, and in the commercial 
HYDRAstor it is set to 12. For a given HYDRAstor in- 
carnation, its supernode cardinality is the same for all 
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supernodes and is constant throughout entire system life- 
time. 

Supernode peers use a distributed consensus algorithm 
to decide what change should be applied to the supern- 
ode — for example, after node failure, they decide on 
which physical nodes new incarnations of lost compo- 
nents should be recovered. 


3.2 Read and Write Handling 


On write, a block of data is routed to one of the peers of 
the supernode responsible for the hashkey space where 
the block’s hash belongs. For both read and write re- 
quests, the peer is deterministically chosen based on 
the hashkey of the data. Next, this write-handling peer 
checks if a suitable duplicate is already stored; this pro- 
cess is described in detail in Section 5.2. If a duplicate is 
found, its address is returned; otherwise the new block is 
compressed (if requested by a user), fragmented, and its 
fragments are distributed to remaining peers. 

On read, the read handling peer first determines the 
minimal number of fragments required to reconstruct this 
block, which is stored in the block’s metadata. Next, the 
read handling peer sends fragment read requests to some 
of the other peers. If any of these requests times out, all 
remaining fragments are read. After a sufficient number 
of fragments have been found, the block is reconstructed, 
decompressed (if it was compressed), its SHA-1 hash is 
verified and, in case of successful verification, returned 
to the user. 

In general, sequential reading is very efficient since the 
blocks are read in the same order that they were written 
and the individual fragments end up getting prefetched 
from disk into local memory. Usually, the requested frag- 
ment is present in the current component location. How- 
ever in some cases (for example after intermittent fail- 
ures), the requested fragment may only be present in one 
of the previous locations of this component. In such a 
case, the component directs a distributed search for the 
missing data. In particular, the trail of previous compo- 
nent locations can be searched in the reverse order. 


3.3. Load Balancing 


In a distributed storage system like the HYDRAstor 
back-end, the distribution of data among physical nodes 
is critical for system survivability, data resiliency and 
availability, storage utilization, and system performance. 
For example, placing too many peer components on 
one machine may have catastrophic consequences if this 
node is lost. The affected supernode may not recover, 
because too many components have been lost; and even 
when it is recoverable, some or even all of the data han- 
dled by this supernode may not be readable, due to loss 


of too many fragments. Also, performance of the system 
is maximized when components are assigned proportion- 
ally to available node resources, since the load on each 
node is proportional to the prefix space covered by the 
components assigned to this node. 

Our system continuously attempts to balance compo- 
nent distribution over all physical machines to reach a 
state where failure resiliency, performance and storage 
utilization are maximized. The quality of a given distri- 
bution is measured by a multi-dimensional function pri- 
oritizing these objectives, called system entropy. Balanc- 
ing is carried out by each machine, which periodically 
considers all possible transfers of locally hosted compo- 
nents to neighboring nodes. If the machine finds a trans- 
fer that would improve the distribution, it is executed. 
After a component arrives at a new location, its data is 
also moved from old location(s) to the new one; but this 
data transfer happens in the background and may take a 
long time. 

The same entropy-driven balancing is applied to the 
system when nodes are added or removed from the sys- 
tem. 


3.4 Impact of Supernode Cardinality 


Selection of supernode cardinality has profound impact 
on properties of HYDRAstor. First of all, it determines 
the maximal number of tolerated node failures. The net- 
work overlay, but not necessarily user data, survives node 
failures as long as each supernode remains alive. A su- 
pernode survives if at least half of the supernode’s peers 
plus one remain alive so they can reach a consensus. 

Supernode cardinality also influences scalability, at 
least in theory. For a given cardinality, the probability 
that each supernode survives is fixed; the higher the car- 
dinality the higher the probability of survival. When a 
system size grows, its number of supernodes also grows, 
and, as a result, the system reliability decreases, as for 
the system to be operational we require all supernodes to 
be alive. However, the practical impact of this limitation 
is negligible in the target range of system size, because 
permanent loss of a physical node is very rare, and self- 
healing reduces the window of vulnerability even when 
it happens. 

Finally, supernode cardinality defines the number of 
data redundancy classes available. Erasure coding is 
parametrized with the maximal number of fragments that 
can be lost while a block remains still reconstructible 
(standard m-of-n erasure codes with n set to supernode 
cardinality and m determined by the redundancy class; 
we use the Cauchy-based Reed-Solomon codes [9]). 
Since in HYDRAstor the erasure coding always produces 
supernode cardinality fragments, the tolerated number of 
lost fragments can vary from one to supernode cardi- 
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nality minus one (in the latter case we keep supernode 
cardinality copies of such block). Each such choice of 
tolerated number of lost fragments defines one data re- 
dundancy class. Each class represents different tradeoff 
between storage overhead (due to erasure coding) and 
failure resiliency. 


4 Data Organization 


Proper representation of stored data is critical for meet- 
ing reliability, availability and performance targets of 
HYDRAstor. The system should be able to easily iden- 
tify the availability of stored data, and in case of a fail- 
ure, rebuild only the data actually written and only to the 
requested resiliency level (as opposed to RAID, which 
rebuilds entire disk even if it contains no valid user data). 
Since components move between nodes followed by the 
data transfer, it should be possible to locate and retrieve 
data from old component locations. When such data is 
available, it should be transferred instead of being re- 
built, as transfer is a much cheaper operation. Data writ- 
ten in one stream should be placed nearby to maximize 
write and read performance. Last but not least, the data 
organization should support on-demand distributed data 
deletion, in which data blocks not reachable from any 
live retention root are deleted and the space occupied by 
them is reclaimed. 


4.1 Synchruns and Synchrun Components 


As discussed earlier, we use erasure coding for data re- 
dundancy. Resulting fragments of one block are dis- 
tributed to peer components of the supernode responsi- 
ble for this block. The basic logical unit of data man- 
agement in HYDRAstor is the synchrun, containing a 
limited number of blocks written consecutively by one 
write-handling peer component. 


A synchrun is analogous to a stripe in a RAID group 
since both allow faster reads and writes of continuous 
data faster than any single disk can do. Unlike a RAID 
stripe, a synchrun is also the basic block that is used for 
data balancing and load management as described below. 
Since writing a block really means writing a supernode 
cardinality of its fragments, each synchrun is represented 
by supernode cardinality of synchrun components, one 
for each peer. For the 7-th peer of a supernode, the cor- 
responding synchrun component contains all 7-th frag- 
ments of the synchrun blocks. A synchrun is a logical 
structure only, but synchrun components actually exist 
on corresponding peers. 
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4.2 Chains of Containers 


At any given time, each write-handling peer writes block 
fragments to exactly one synchrun. As a result, all such 
synchruns can be logically ordered in a chain, with the 
order determined by the write-handling peer. Synchrun 
components are placed in a data structure called syn- 
chrun component container (SCC). Each SCC can con- 
tain one or more chain-adjacent synchrun components, 
and as a result, SCCs form also chains similar to syn- 
chrun component chains. 
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Figure 3: Data organization with synchruns and syn- 
chrun containers. 


The upper row in Fig. 3 shows synchruns A and B 
that belong to the empty prefix supernode which covers 
the entire hashkey space. Each synchrun component 
is placed here in one SCC, with its individual fragments 
represented by smaller boxes inside the SCC. SCCs with 
synchrun components of these synchruns are shown as 
rectangles placed one behind the other. A chain of syn- 
chruns is represented by the supernode cardinality of 
SCC chains, we call them peer SCC chains. In the re- 
mainder of the Fig. 3 we show only one such peer SCC 
chain. 

Peer SCC chains are normally identical with regards to 
the synchrun components’ metadata and the number of 
fragments they hold, but there are occasional differences 
caused by node failures which cause holes in the chains. 
This chain organization allows for relatively simple and 
efficient implementation of required features. For ex- 
ample, if the number of peer chains without any holes is 
not lower than the number of fragments needed to recon- 
struct each block, then we infer that the data is available 
(i.e. all blocks are reconstructible). In such way, deter- 
mination of data availability can easily be made for each 
redundancy class. 

Each supernode will eventually be split as more data is 
stored or as more nodes are added to the system. This is 
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a regular FPN split which results in two new supernodes 
with prefixes extended from their ancestor prefix with, 
respectively, 0 and 1. After a supernode split, each syn- 
chrun in this supernode is also split, with fragments dis- 
tributed between them based on their hash prefixes. The 
second row of Fig. 3 shows two such chains, one for the 
supernode with the prefix 0, and the other with the pre- 
fix 1. As a result of the split, fragments from synchruns 
A and B are distributed to these two chains. The system 
now has 4 synchruns, each approximately half the size of 
the orignals. 

The system strives to maintain a limited number of 
local SCCs, and merges adjacent synchrun components 
into one SCC (as shown on the third row of Fig. 3) until 
the maximum size of an SCC is reached. By limiting the 
number of local SCCs, the system can keep their meta- 
data cached in RAM which enables fast determination of 
actions needed for providing data services. The target 
size of an SCC is a configuration constant (usually set 
well below 100 MB), so multiple SCCs can be read into 
the main memory. These SCC concatenations are loosely 
synchronized on all peers, so peer chains look the same. 
A similar operation is needed after deletion, shown in 
the remaining rows of this figure and discussed later in 
Section 5.3 

This data organization is relatively simple in a static 
system, but it becomes quite complex due to the dynamic 
nature of the HYDRAstor back-end. For example, when 
a peer is transferred to another physical node because 
of load balancing, its chains are transferred in the back- 
ground to a new location, one SCC at a time. Similarly, 
after a supernode split, not all SCCs of the supernode are 
split immediately; instead we run background operations 
adjusting chains to the current supernode locations and 
shape. As a result, in any given moment, we may have 
chains partially-split, partially present in previous loca- 
tions of this peer, or both. After failure, we may have se- 
rious holes in some of the chains. Fortunately, since peer 
chains describe the same data, we have supernode cardi- 
nality chain redundancy in the system, so usually there 
is a sufficient number of complete chains. This chain 
redundancy allows for reasoning about the data in the 
system even in the presence of transfers/failures. Addi- 
tionally, more refined algorithms are used in some cases, 
constructing chain coverage from chain parts present on 
different peers. 


5 Data Services 


Based on the data organization described above, 
HYDRAstor efficiently builds data services like identi- 
fication of the recoverability of data, deletion and space 
reclamation, locating data in the network, data dedupli- 
cation and others. Given a detailed description of all of 


these features is beyond the scope of this paper, but this 
section will present a sketch of the data rebuilding, dele- 
tion and duplicate elimination services. 


5.1 Data Rebuilding 


When a node or disk fails, the SCC’s residing on that 
node or disk are lost. As a result, the redundancy of the 
data blocks with fragments belonging to these SCCs is 
at best reduced below the level requested by users when 
writing these blocks. In the worst case, a block may be 
lost completely if not enough fragments survive. To keep 
the block redundancy at the desired levels, the system 
scans SCC chains looking for holes and schedules data 
rebuilding as background jobs for each missing SCC. 

Multiple peer SCCs can be rebuilt in one rebuilding 
operation. Based on SCC metadata, the minimal number 
of peer SCCs needed for rebuilding is read by the peer 
that is in charge of rebuilding. This peer does bulk era- 
sure decoding and encoding to restore the missing frag- 
ments. Next, the rebuilt SCCs are sent to the current tar- 
get locations. Before SCCs are rebuilt, all input SCCs 
are made to look the same, i.e. required splits and con- 
catenations are performed first. This requirement allows 
for fast bulk rebuilding as measured in Section 6. 


5.2 Duplicate Elimination 


Duplicate elimination can be classified in many dimen- 
sions: (1) the granularity of the deduplication: whole 
files, partial files, fixed size blocks or variable sized 
blocks; (2) time when the deduplication occurs: inline 
during the write phase or as a background process; (3) 
precision of duplicate identification: can the system re- 
liably find all duplicates or does it use an approximate 
technique which trades precision for increased perfor- 
mance? (4) the verification of equality between a dupli- 
cate and its copy: just by comparison of hashes or with 
full data comparison; (5) the scope of the deduplication: 
the whole system (global deduplication), or the dedupli- 
cation limited for example to data on a specific node (lo- 
cal deduplication). 

Today HYDRAstor implements variable-sized block, 
inline, hash-verified global duplicate elimination imple- 
mented on storage nodes. Variable-size blocks allow for 
better deduplication, because content-dependent chunk- 
ing can be used ([40]). Inline deduplication increases 
write throughput, since duplicated block writes can be 
handled without writing to disk; this also increases stor- 
age efficiency compared to off-line deduplication. For 
regular blocks, we use fast approximate deduplication, 
whereas for retention roots, we do reliable duplicate 
elimination to ensure that searchable retention roots with 
the same search key but different contents are not written. 


7th USENIX Conference on File and Storage Technologies 203 


204 


In both cases, for successful deduplication, we require 
that the potential duplicate of the block being written has 
a redundancy class not weaker than the class requested 
by this write and that the potential old duplicate is recon- 
structible. 

On a regular block write, the peer handling this write 
is selected based on the hash of this block. It means that 
two identical blocks written when this peer is alive will 
be handled by it, and the second block will be found a 
duplicate of the first one. 

A more complicated case arises when the write- 
handling peer has been recently created because of trans- 
fer or component recovery, and it does not have yet all 
the data it should have, i.e. its local SCC chain is not 
complete. In this case, we go to the longest-alive peer 
in the current supernode to check for possible duplicates. 
This is just a heuristics, as this peer may also not have 
the proper SCC chain complete, so a duplicate may not 
be detected. However, such a miss occurs only in corner 
cases, after massive failures when most likely all chains 
are broken. Moreover, for a particular block, we miss 
only one opportunity to eliminate a duplicate; the next 
duplicate block will be deduplicated unless another fail- 
ure or transfer of this peer happens. 

For retention roots, we need to ensure that two blocks 
with the same search key have identical contents (oth- 
erwise retention roots would not uniquely identify snap- 
shots). As a result, we need accurate duplicate elimi- 
nation for retention roots. When a local SCC chain has 
holes at the peer handling this write, the peer sends dupli- 
cate elimination queries to all other peers in this supern- 
ode. Each of these peers checks locally for a duplicate. 
A negative answer also includes a summary description 
of the parts of the SCC chain on which this answer is 
based. The write handling peer collects all replies. If 
there is at least one positive, a duplicate is found; other- 
wise, when all are negative, this peer tries to determine if 
SCC information attached to negative replies covers one 
entire SCC chain. If yes, the new block is not a dupli- 
cate; otherwise such determination cannot be done and 
the write is rejected with special error status indicating 
that data rebuilding is in progress (this may happen af- 
ter massive failures); in such case this write should be 
submitted later. Needless to say, such situations so far 
happened only in special tests, and never in practice. 


5.3. Deletion and Space Reclamation 


Implementing data deletion in a system like HYDRAstor 
turned out to be surprisingly difficult because of many 
challenges which stem from the nature of the sys- 
tem: content-addressability, distribution, failure toler- 
ance, and duplicate elimination. While deletion in our 
content-addressable system is somehow similar to dis- 
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tributed garbage collection [29], which is well under- 
stood, overcoming the remaining challenges, discussed 
below, required new research. 

When deciding if a block is to be duplicate-eliminated 
against its older copy, we must be sure that this old block 
is not scheduled for deletion. Deciding which block to 
keep and which to delete must be globally consistent and 
robust in the presence of failures. For example, a dele- 
tion decision made should not be temporarily lost due 
to intermittent failures, as otherwise we may eliminate 
duplicates using blocks which are really scheduled for 
deletion. Moreover, the robustness of the data deletion 
algorithm should be higher than the data robustness. As 
aresult, even if some blocks are lost, data deletion should 
be able to proceed to logically remove the lost data and 
heal the system if requested to do so by the user. 

To simplify the design and make the implementa- 
tion manageable, we have implemented deletion in two 
phases. During the first phase, the system is read-only 
and blocks are marked for deletion. In the second phase, 
the data can be read and written, as the system reclaims 
the blocks marked for deletion. Having a read-only phase 
simplified the deletion implementation, because such ap- 
proach lets us eliminate the impact of writes on marking 
blocks for removal. 

Deletion is implemented with a per-block reference 
counter that counts the number of pointers in blocks in 
the system pointing to this block. Reference counters are 
not updated immediately on write. Instead, they are up- 
dated later in the read-only phase processing all point- 
ers written since the previous read-only phase (so the 
counter update is incremental). For each such pointer, 
the reference counter of the pointed block is incre- 
mented. After all such incrementation is completed, all 
blocks with reference counter equal to zero are marked 
for deletion (dark-shaded fragments in Fig. 3). More- 
over, reference counters of blocks pointed by blocks al- 
ready marked for deletion (including roots with associ- 
ated deletion roots) are decremented. Next, the whole 
decrementation process (i.e. marking for removal blocks 
with reference counters equal to zero and decrementing 
reference counters of blocks pointed by pointers included 
in these blocks) is repeated, until no more new blocks can 
be marked for deletion. At this point, the read-only phase 
ends, and blocks marked for deletion can be removed in 
the background. 

The deletion algorithm described above requires that 
the metadata of all blocks, as well as all the pointers, be 
present before proceeding. The pointers and block meta- 
data are replicated on all peers, so the deletion can pro- 
ceed even if some blocks are no longer reconstructible, 
as long as at least one block fragment exists. 

Since blocks are really kept as fragments, a copy of the 
block reference counter is kept per-fragment, and each 
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fragment of a given block should have the same value 
of this counter. Reference counters are computed inde- 
pendently on peers participating in the read-only phase. 
Before deletion is started, each such peer must have its 
SCC chain complete with respect to fragment metadata 
and pointers. Not all peers in a supernode have to par- 
ticipate, but some minimal number of peers is required 
to complete the read-only phase. The computed coun- 
ters are later propagated in the background to remaining 
peers. 

The redundancy in counter computation allows a dele- 
tion decision to survive node failures. However, the in- 
termediate results of deletion computations are not per- 
sistent. Any failure before the decision is made wipes 
out these results on the affected nodes, and the whole 
computation needs to be repeated if too many peers can- 
not participate in this phase any more. Deletion can 
still continue, if a sufficient number of peers in each su- 
pernode are not affected by the failure. Upon conclu- 
sion of the read-only phase, the new counter values are 
made failure-tolerant. All dead blocks i.e. blocks with 
counters equal to zero are then swept out from physical 
storage in the background (reclamation in Fig. 3). Free 
space fragmentation is avoided by rewriting the whole 
synchrun component container, copying only fragments 
of live blocks to the new location. 


6 Evaluation 


Each current HYDRAstor storage node (SN) runs one 
back-end server, and has six 500 GB SATA disks, 6GB 
RAM, two dual-core 3 GHz CPUs and two GigE cards. 
Some experiments have also been done with the ex- 
perimental next generation hardware (denoted SN2), in 
which each storage node runs two back-end servers and 
has twelve | TB SATA disks, 20 GB of RAM, two quad- 
core 3GHz CPUs and four GigE cards. In all experi- 
ments, we used the current access node (AN) with 6GB 
RAM, two dual-core 3 GHz CPUs, two GigE cards and 
only limited local storage. All nodes run the Red Hat EL 
5.1 version of Linux. 

All experiments were performed using block size of 
64KB compressible by 33% to 48KB except where 
noted. The system was configured with a supernode car- 
dinality of 12 and the number of supernodes was equal to 
the number of physical machines. All of the tests wrote 
data using a resiliency class which has 9 original and 3 
redundant fragments. 


6.1 Read/Write Bandwidth 


This experiment shows write throughput as a function 
of the fraction of blocks detected as duplicates for two 


different compression ratios. We have used 4 SN2 ma- 
chines, and 4 AN machines, each with one testing driver 
able to generate a stream of blocks with a specified per- 
centage of duplicates and compression ratio. Duplicates 
are evenly distributed in the stream. Duplicated data is 
written in the same order as the base data, re-creating the 
original data stream. For the read experiment, the testing 
driver attempts to read data in the same order as it was 
written. 
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Figure 4: Write throughput as a function of duplicate ra- 
tio. 


As shown in Fig. 4, very high bandwidth is achieved, 
which is a consequence of a carefully chosen data organi- 
zation utilizing bulk transfer to disk. Duplicates are pro- 
cessed much more effectively than non-duplicated data, 
because they do not require fragmentation, fragment dis- 
tribution and storage. Moreover, SCC-based organiza- 
tion allows the write-handling peer to perform fast local 
duplicate elimination by checking block reconstructibil- 
ity with SCC reports submitted in the background from 
the remaining peers. However, when all writes are du- 
plicates, the network bandwidth between AN and SNs 
becomes a bottle-neck, and the overall performance does 
not increase as much as expected (both curves flatten a 
bit at 100% duplicates). For high deduplication ratios, 
the CPU utilization decreases dramatically and network 
bandwidth between storage nodes remains available, so 
background tasks like data reconstruction and data scrub- 
bing can be run without impact on user-visible perfor- 
mance. 

Read bandwidth highly depends on factors like the se- 
quentiality of the data read, the number of drivers reading 
simultaneously and the granularity of the distribution of 
the duplicates in the data. A detailed discussion of the 
impact of these factors on read performance is beyond 
the scope of this paper. Instead, we give read through- 
put achieved when reading the data written during the 
experiment described above. With four drivers reading, 
the total combined read bandwidth for indicated levels 
of deduplication was between 450 MB/s and 790MB/s 
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for 33% compressible data, and between 400 MB/s and 
550MB/s for 0% compressible data. 

The time required to fill the 4 SN2 node system de- 
pends on the percentage of duplicates in the data written. 
The system can be filled in 1 day when writing with no 
duplicates, while filling a system with 95% duplicated 
data can take up to 10 days. In general, for configura- 
tions in which high performance is not a priority, fewer 
ANs can be used resulting also in extended time-to-fill. 

These results were obtained with testing drivers run- 
ning on the ANs. Experiments with real backup ap- 
plications using the filesystem front-end yielded similar 
performance. However, since the experiments were not 
done in a controlled setup, their results are not presented 
here. 


6.2 System Scaling 


This experiment, with up to 12 SNs and the number of 
ANs set to half of the number of SNs, shows how perfor- 
mance is scaled when numbers of storage nodes and ac- 
cess nodes increase. Two sets of measurements are done 
— a dynamic one, in which nodes are added while the 
user is writing, and a static one in which the number of 
nodes remains constant during the test. In the latter case, 
each measurement was taken after re-initializing the sys- 
tem from scratch and then loading the same amount of 
random, non-duplicated data. Time on the X axis refers 
to the dynamic case only. 
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Figure 5: Dynamic vs. static scalability test. 


The results indicate that in the range of nodes tested 
the system performance scales linearly with the system 
growth in the static case. The system attempts to balance 
components so that the hash space is divided equally 
across storage nodes. Such balancing guarantees that ev- 
ery machine is equally loaded and does not become a 
bottleneck. In the dynamic case, the cost of dynamically 
reconfiguring the system results in lower user bandwidth. 
This happens since most of the data is on the older nodes 
which are checked on every write for duplicate elimina- 
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tion. However, after all data transfers are completed, the 
performance in the dynamic case will be the same as in 
the stable case. 


6.3 Node Failure and Data Rebuilding 


This experiment shows the system behavior and its per- 
formance just after node failure, during resulting data re- 
construction, and after the failed node is recovered. The 
system tested has 4 SN2 machines and 4 AN machines. 
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Figure 6: Node failure during backup. 


We started writing to the healthy system with four stor- 
age nodes, achieving write throughput over 600 MB/s. 
After about 14 minutes one storage node failed (both 
back-end servers crashed). Write performance just after 
the node failure dropped to 300 MB/s, then stabilized at 
about 400 MB/s. The initial drop was caused by timed- 
out messages to the failed node and overhead for sys- 
tem rebalancing. Data rebuilding (reconstruction) tasks 
were ordered, however they were suppressed because of 
the ongoing user backup. Reconstruction started to work 
with full bandwidth just after all user writes had been 
finished. Every block reconstruction required reading 
all 9 remaining fragments in order to rebuild the 3 lost 
ones. The reconstruction read bandwidth reached 480 
MB/s on the 3 surviving machines with a reconstruction 
write bandwidth of 160 MB/s. The rebuilding finished 
in the 58th minute of the experiment leaving a healthy 
system with only 3 storage nodes. 

In the 64th minute the next writing session started 
achieving write bandwidth of 430 MB/s. The failed node 
was recovered and connected once again in the 100th 
minute. Just after the re-connection, system write band- 
width dropped to 380 MB/s, but when components rebal- 
ancing was finished it increased to about 550 MB/s. At 
the end of the experiment the system had 4 storage nodes, 
however it was not healthy, as not all data (SCCs) were 
in the correct places. Write performance will increase 
to the initial (600 MB/s) after all pending transfers are 
finished and the system becomes healthy again. 
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The results show that the system maximizes user band- 
width during backup even if background tasks are pend- 
ing. In particular, ongoing reconstruction is suspended if 
a new backup is started. This approach allows a user to 
minimize costly backup windows regardless of internal 
system state, but carries the risk of starvation of critical 
data rebuilding tasks. However, this may happen only if 
the system is fully loaded by a user all the time and only 
when the user writes non-duplicated data. If the user load 
decreases or some duplicates are written, reconstruction 
is executed in the background. Finally, this experiment 
also shows how quickly the system adjusts to changes in 
its environment, as it takes only a few minutes for the 
system to fully utilize released resources. 
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Figure 7: Read-only phase duration. 


6.4 Data Deletion 


The purpose of this experiment is to evaluate the du- 
ration of the read-only phase as a function of the data 
loaded with a focus on scenarios reflecting a realistic 
system usage. Therefore, we use the file system interface 
to write and delete data periodically, increasing overall 
the amount of data stored in the system. Deletion exper- 
iments were performed using 4 SN2 machines and | AN 
machine. 

The test runs the following four step sequence four 
times. During the initial step (shown with the dotted 
lines in the Fig. 7), the data is loaded into the system. 
The loading phase pauses at 1TB, 2TB, 4TB and 8TB. 
In the second step, during each pause, the read-only data 
deletion phase is run which recomputes the counters for 
the newly loaded data (note that no data is marked for 
deletion at this point). The duration of this step is shown 
with light-gray bars. The third step (shown with dashed 
lines), an additional half terabyte of new data (ND) is 
loaded and a user invokes a deletion operation of 0.2 TB 
of older data (DD). In the last step, one more read-only 


phase is run to recompute the counters to reflect the re- 
cently loaded data and mark the blocks for deletion. 

The duration of each read-only phase is shown with 
dark-gray bars. In all cases, the new data is not com- 
pressible and does not contain any duplicates, but with 
duplicates present the results will be similar, except that 
all phases will be shorter. 

Although the X axis in Fig. 7 shows duration of each 
read-only phase, the data-loading steps are not shown in 
proportion there, because they are too big (we load ter- 
abytes of data and it takes several hours). We note that 
all read-only phases are relatively short, the longest one, 
after loading 3.7 TB of data (which took about 4 hours) 
is about 30 minutes, resulting in deletion time of under 
13% of writing time. For writing with two ANs, this frac- 
tion can go up to 20% in case of not-duplicated streams. 
When writing data with a high number of duplicates (the 
common case with backups), deletion takes significantly 
less time (on the order of 5% of the writing time), since 
less data needs to be read to access the pointers, and fill- 
ing the capacity takes so much longer. Moreover, the 
duration of the first read-only phase (shown with the 
light-gray bars) in each sequence is proportional to the 
new data loaded in the first step of the scenario. Finally, 
the duration of the second read-only phase (shown with 
dark bars) is fairly constant, taking around 11 minutes 
perrun. This also shows the power of the incremental 
reference-counting deletion in HYDRAstor. The dura- 
tion of the read-only phase depends only on the amount 
of data added and deleted since the previous run of this 
phase, but not on the total amount of data in the system. 


7 Related Work 


A significant number of distributed storage systems [8, 
10, 11] are designed as large scale systems which are 
distributed over wide area networks and built with un- 
trusted peers. These systems undergo frequent configu- 
ration changes. For example, the goal of OceanStore [8] 
was to provide reliable storage for all data ever cre- 
ated. These systems concentrated on scalability (e.g. 
OceanStore, PAST [11]) and tolerating a large class of 
failures, including Byzantine and large-scale correlated 
failures (Glacier [18]) at the expense of performance. 
Another group of distributed storage systems targeted 
the data center and, in this, are more like HYDRAstor. 
These systems include distributed virtual disk like 
Petal [19], distributed file systems like CEPH [37] and 
Farsite [6], clustered file systems like Sorrento [34], 
Panasas [39], and GoogleFS [15], clustered storage in- 
cluding Ursa Minor [5], RADOS [38], and FAB [28]. 
Compared to HYDRAstor these systems have different 
target applications and are not advertised as secondary 
storage. As a result, they do not provide deduplication 
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(except Farsite, which does it on file level); these sys- 
tems are not CAS-based, but need to deal with issues of 
consistency in the presence of write-sharing, which do 
not occur in our system. Ursa Minor does support user- 
selected choices of data resiliency, similar to our data 
resiliency classes. DISP [14] is a flexible system that 
can be specialized to both WAN and data center. Like 
HYDRAstor, DISP uses erasure codes, but it does not 
provide deduplication. 

Venti [24], EMC Centera [4, 16], Pergamum [32] and 
DataDomain [43] are secondary storage systems. Venti, 
Pergamum and Centera target archiving, whereas Data- 
Domain is designed to store backup data. Pergamum 
does not support duplicate elimination, Venti prototype 
and Centera do it, respectively, on fixed block size and 
entire file level. Centera might be able to do chunk- 
level deduplication, based on available information [16], 
but the chunk size seems to be much larger (1OOMB 
per chunk versus the HYDRAstor 64KB). These ap- 
proaches result in lower deduplication than a variable- 
block size approach used by HYDRAstor and DataDo- 
main. However, DataDomain is a centralized system and 
does not do global deduplication in distributed environ- 
ment. HYDRAstor provides global deduplication using 
also variable block chunking with comparable write per- 
formance. RepStore [41], a smart-brick scalable storage 
system, uses erasure codes and content-based address- 
ing, but does not provide deduplication. Deep Store [40], 
an archiving system, employs multitude of techniques 
for reducing stored data size, including delta compres- 
sion and variable-block-size deduplication. However, 
this system does not target backup data. 

Blocks in our system have some resemblance to ob- 
jects in the object-based storage [22], as they have at- 
tributes (for example resiliency class) and simple inter- 
face to access its components like pointers. 

Many systems introduce structures similar to SCCs for 
block aggregation. Venti uses arenas to serve as a unit 
of data maintenance; however, they do not take advan- 
tage of the sequential nature of incoming data streams 
and achieve very low performance. The Foundation [26] 
CAS Layer improves Venti’s sequential write perfor- 
mance by prefetching entire arenas when duplicates are 
written. However, since Foundation is designed for per- 
sonal use, it does not have to deal with the problem of 
multiple streams written concurrently but later read sep- 
arately. DataDomain introduces containers to group se- 
quential writes from each stream of data to increase ef- 
fectiveness of read-ahead caching. HYDRAstor achieves 
a similar result by sorting incoming blocks by their 
stream id and flushing them out to disk in batches. Using 
separate containers for every stream in HYDRAstor is 
not feasible, as the number of containers written concur- 
rently may be very large for big systems. HYDRAstor 
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data organization is unique in use of replicated chains of 
containers which allow for reasoning about state of the 
data in the system. 

Deletion in a distributed storage system is relatively 
simple if there is no duplicate elimination. It can be done 
with leases like in Glacier [18], or with simple recla- 
mation of obsolete versions like in Ursa Minor. How- 
ever, with deduplication, deletion becomes difficult for 
reasons explained earlier. For example, Venti and Deep 
Store have not implemented deletion. As far as we know, 
the HYDRAstor back-end approach to deletion is unique. 
The use of blocks with pointers, retention and deletion 
roots and redundant chains of containers enables an effi- 
cient, fault-tolerant implementation of a distributed dele- 
tion. 


8 Conclusions and Future Work 


HYDRAstor is a decentralized, scalable secondary stor- 
age that is commercially available today. It can be used 
as an on-line repository for all enterprise backup and 
archival data while dynamically and efficiently sharing 
available capacity. Critical features like high-availability 
and reliability, ease of management, capacity and perfor- 
mance scalability, and storage efficiency make the sys- 
tem unique in addressing today’s enterprise needs. The 
system is externally visible as one storage pool and can 
be accessed by legacy applications using traditional file 
system interface. 

The core architecture is built around a DHT with vir- 
tual supernodes spanned over physical nodes. Data re- 
siliency is provided with erasure codes, with fragments 
of erasure-coded blocks distributed among supernode 
components. Redundancy in the network and data allows 
for on-line upgrades and extensions, increasing availabil- 
ity of the system. High storage efficiency is facilitated by 
variable block size global deduplication. The back-end 
exports a low-level API providing operations on content- 
addressed blocks which expose pointers to other blocks. 
A novel data organization based on redundant chains of 
data containers is used to deliver reliably multitude of 
data services, including failure-tolerant deletion and fast 
verification of data health. 

Although the system is fully functional today, there is 
an important work left to improve its value delivered to 
the end user. The read-only phase of deletion will be 
eliminated, which will make the system fully usable all 
the time. Deduplication can be moved to a proxy server, 
saving bandwidth and improving write performance of 
highly-duplicated streams. Additionally, since multiple 
types of drivers can write to the back-end, there is a need 
for a stream interface that can cut data into blocks in 
a standard way. This will ensure higher deduplication 
among data written by different types of clients. 
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Abstract 


The Smoke and Mirrors File System (SMFS) mirrors 
files at geographically remote datacenter locations with 
negligible impact on file system performance at the pri- 
mary site, and minimal degradation as a function of 
link latency. It accomplishes this goal using wide-area 
links that run at extremely high speeds, but have long 
round-trip-time latencies—a combination of properties 
that poses problems for traditional mirroring solutions. 
In addition to its raw speed, SMFS maintains good syn- 
chronization: should the primary site become completely 
unavailable, the system minimizes loss of work, even for 
applications that simultaneously update groups of files. 
We present the SMFS design, then evaluate the system 
on Emulab and the Cornell National Lambda Rail (NLR) 
Ring testbed. Intended applications include wide-area 
file sharing and remote backup for disaster recovery. 


1 Introduction 


Securing data from large-scale disasters is important, es- 
pecially for critical enterprises such as major banks, bro- 
kerages, and other service providers. Data loss can be 
catastrophic for any company — Gartner estimates that 
40% of enterprises that experience a disaster (e.g. loss 
of a site) go out of business within five years [41]. Data 
loss failure in a large bank can have much greater conse- 
quences with potentially global implications. 

Accordingly, many organizations are looking at dedi- 
cated high-speed optical links as a disaster tolerance op- 
tion: they hope to continuously mirror vital data at re- 
mote locations, ensuring safety from geographically lo- 
calized failures such as those caused by natural disas- 
ters or other calamities. However, taking advantage of 
this new capability in the wide-area has been a chal- 
lenge; existing mirroring solutions are highly latency 
sensitive [19]. As a result, many critical enterprises op- 
erate at risk of catastrophic data loss [22]. 

The central trade-off involves balancing safety against 


performance. So-called synchronous mirroring solu- 
tions [6, 12] block applications until data is safely mir- 
rored at the remote location: the primary site waits for 
an acknowledgment from the remote site before allow- 
ing the application to continue executing. These are 
very safe, but extremely sensitive to link latency. Semi- 
synchronous mirroring solutions [12, 42] allow the ap- 
plication to continue executing once data has been writ- 
ten to a local disk; the updates are transmitted as soon 
as possible, but data can still be lost if disaster strikes. 
The end of the spectrum is fully asynchronous: not only 
does the application resume as soon as the data is writ- 
ten locally, but updates are also batched and may be 
transmitted periodically, for instance every thirty min- 
utes [6, 12, 19, 31]. These solutions perform best, but 
have the weakest safety guarantees. 


Today, most enterprises primarily use asynchronous or 
semi-synchronous remote mirroring solutions over the 
wide-area, despite the significant risks posed by such 
a stance. Their applications simply cannot tolerate the 
performance degradation of synchronous solutions [22]. 
The US Treasury Department and the Finance Sector 
Technology Consortium have identified the creation of 
new options as a top priority for the community [30]. 


In this paper, we explore a new mirroring option called 
network-sync, which potentially offers stronger guaran- 
tees on data reliability than semi-synchronous and asyn- 
chronous solutions while retaining their performance. It 
is designed around two principles. First, it proactively 
adds redundancy at the network level to transmitted data. 
Second, it exposes the level of in-network redundancy 
added for any sent data via feedback notifications. Proac- 
tive redundancy allows for reliable transmission with la- 
tency and jitter independent of the length of the link, a 
property critical for long-distance mirroring. Feedback 
makes it possible for a file system (or other applications) 
to respond to clients as soon as enough recovery data has 
been transmitted to ensure that the desired safety level 
has been reached. Figure | illustrates this idea. 
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Figure 1: Remote Mirroring Options. (1) Synchronous mirroring provides a remote-sync guarantee: data is not lost 
in the event of disaster, but performance is extremely sensitive to the distance between sites. (2) Asynchronous and 
semi-synchronous mirroring give a local-sync guarantee: performance is independent of distance between mirrors, 
but can suffer significant data loss when disaster strikes. (3) A new network-sync mirroring option with performance 


similar to local-sync protocols, but with improved reliability. 


Of course, data can still be lost; network-sync is not 
as safe as a synchronous solution. If the primary site 
fails and the wide-area network simultaneously parti- 
tions, data will still be lost. Such scenarios are un- 
common, however. Network-sync offers the developer 
a valuable new option for trading data reliability against 
performance. 

Although this paper focuses on the Smoke and Mir- 
rors File System (SMFS), we believe that many kinds of 
applications could benefit from a network-sync option. 
These include other kinds of storage systems where re- 
mote mirroring is performed by a disk array (e.g. [12]), a 
storage area network (e.g. [19]), or a more traditional file 
server (e.g. [31]). Network-sync might also be valuable 
in transactional databases that stream update logs from 
a primary site to a backup, or to other kinds of fault- 
tolerant services. 

Beyond its use of the network-sync option, SMFS has 
a second interesting property. Many applications update 
files in groups, and in such cases, if even one of the files 
in a group is out of date, the whole group may be useless 
(Seneca [19] calls this atomic, in-order asynchronous 
batched commits; SnapMirror [31] offers a similar ca- 
pability). SMFS addresses the need in two ways. First, 
if an application updates multiple files in a short period 
of time, the updates will reach the remote site with min- 
imal temporal skew. Second, SMFS maintains group- 
mirroring consistency, in which files in the same file sys- 
tem can be updated as a group in a single operation where 
the group of updates will all be reflected by the remote 
mirror site atomically, either all or none. 

In summary, our paper makes the following contribu- 
tions: 


e We propose a new remote mirroring option called 
network-sync in which error-correction packets are 
proactively transmitted, and link-state is exposed 
through a callback interface. 


¢ We describe the implementation and evaluation 
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of SMFS, a new mirroring file system that sup- 
ports both capabilities, using an emulated wide-area 
network (Emulab [40]) and the Cornell National 
Lambda Rail (NLR) Ring testbed [1]. This evalu- 
ation shows that SMFS: 


— Can be tuned to lose little or no data in the 
event of a rolling disaster. 


— Supports high update throughput, masking 
wide-area latency between the primary site 
and the mirror. 


— Minimizes jitter when files are updated in 
short periods of time. 


¢ We show that SMFS has good group-update per- 
formance and suggest that this represents a benefit 
to using a log-structured file architecture in remote 
mirroring. 


The rest of this paper is structured as follows. We dis- 
cuss our fault model in Section 2. In Section 3, we de- 
scribe the network-sync option. We describe the SMFS 
protocols that interact with the network-sync option in 
Section 4. In Section 5, we evaluate the design and im- 
plementation. Finally, Section 6 describes related work 
and Section 7 concludes. 


2 What’s the Worst that Could 
Happen? 


We argue that our work responds to a serious impera- 
tive confronted by the financial community (as well as by 
other critical infrastructure providers). As noted above, 
today many enterprises opt to use asynchronous or semi- 
synchronous remote mirroring solutions despite the risks 
they pose, because synchronous solutions are perceived 
as prohibitively expensive in terms of performance [22]. 
In effect, these enterprises have concluded that there sim- 
ply is no way to maintain a backup at geographically re- 
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Figure 2: Example Failure Events. A single failure event may not result in loss of data. However, multiple nearly- 
simultaneous failure events (i.e. rolling disaster) may result in data loss for asynchronous and semi-synchronous 


remote mirroring. 


mote distances at the update rates seen within their data- 
centers. Faced with this apparent impossibility, they lit- 
erally risk disaster. 

It is not feasible to simply legislate a solution, be- 
cause today’s technical options are inadequate. Finan- 
cial systems are under huge competitive pressure to sup- 
port enormous transaction rates, and as the clearing time 
for transactions continues to diminish towards immedi- 
ate settlement, the amounts of money at risk from even 
a small loss of data will continue to rise [20]. Asking 
a bank to operate in slow-motion so as to continuously 
and synchronously maintain a remote mirrored backup 
is just not practical: the institution would fail for reasons 
of non-competitiveness. 

Our work cannot completely eliminate this problem: 
for the largest transactions, synchronous mirroring (or 
some other means of guaranteeing that data will survive 
any possible outage) will remain necessary. Nonetheless, 
we believe that there may be a very large class of ap- 
plications with intermediary data stability needs. If we 
can reduce the window of vulnerability significantly, our 
hypothesis is that even in a true disaster that takes the 
primary site offline and simultaneously disrupts the net- 
work, the challenges of restarting using the backup will 
be reduced. Institutions betting on network-sync would 
still be making a bet, but we believe the bet is a much 
less extreme one, and much easier to justify. 


Failure Model and Assumptions: We assume that 
failures can occur at any level — including storage de- 
vices, storage area network, network links, switches, 
hubs, wide-area network, and/or an entire site. Further, 
we assume that they can fail simultaneously or even in 
sequence: a rolling disaster. However, we assume that 
the storage system at each site is capable of tolerating 
and recovering from all but the most extreme local fail- 
ures. Also, sites may have redundant network paths con- 


necting them. This allows us to focus on the tolerance of 
failures that disable an entire site, and on combinations of 
failures such as the loss of both an entire site and the net- 
work connecting it to the backup (what we call a rolling 
disaster). Figure 2 illustrates some points of failure. 

With respect to wide-area optical links, we assume that 
even though industry standards essentially preclude data 
loss on the links themselves, wide-area connections in- 
clude layers of electronics: routers, gateways, firewalls, 
etc. These components can and do drop packets, and at 
very high data rates, so can the operating system on the 
destination machine to which data is being sent. Accord- 
ingly, our model assumes wide-area networks with high 
data rates (10 to 40 Gbits) but sporadic packet loss, po- 
tentially bursty. The packet loss model used in our exper- 
iments is based on actual observations of TeraGrid, a sci- 
entific data network that links scientific supercomputing 
centers and has precisely these characteristics. In partic- 
ular, Balakrishnan et al. [10] cite loss rates over 0.1% at 
times on uncongested optical-link paths between super- 
computing centers. As a result, we emulate disaster with 
up to 1% loss rates in our evaluation of Section 5. 

Of course, reliable transmission protocols such as TCP 
are typically used to communicate updates and acknow]- 
edgments between sites. Nonetheless, under our assump- 
tions, a lost packet may prevent later received packets 
from being delivered to the mirrored storage system. The 
problem is that once the primary site has failed, there 
may be no way to recover a lost packet, and because 
TCP is sequenced, all data sent after the lost packet will 
be discarded in such situations — the gap prevents their 
delivery. 


Data Loss Model: We consider data to be Jost if an 
update has been acknowledged to the client, but the cor- 
responding data no longer exists in the system. Today’s 
remote mirroring regimes all experience data loss, but 
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the degree of disaster needed to trigger loss varies: 


¢ Synchronous mirroring only sends acknowledg- 
ments to the client after receiving a response from 
the mirror. Data cannot be lost unless both primary 
and mirror sites fail. 


¢ Semi-synchronous mirroring sends acknowledg- 
ments to the client after data written is locally stored 
at the primary site and an update is sent to the mir- 
ror. This scheme does not lose data unless the pri- 
mary site fails and sent packets do not make it to 
the mirror. For example, packets may be lost while 
resident in local buffers and before being sent on 
the wire, the network may experience packet loss, 
partition, or components may fail at the mirror. 


e Asynchronous mirroring sends acknowledgments to 
the client immediately after data is written locally. 
Data loss can occur even if just the primary site 
fails. Many products form snapshots periodically, 
for example, every twenty minutes [19, 31]. Twenty 
minutes of data could thus be lost if a failure dis- 
rupts snapshot transmission. 


Goals: Our work can be understood as an enhancement 
of the semi-synchronous style of mirroring. The basic 
idea is to ensure that once a packet has been sent, the 
likelihood that it will be lost is as low as possible. We 
do this by sending error recovery data along with the 
packet and informing the sending application when error 
recovery has been sent. Further, by exposing link state, 
an error correcting coding scheme can be tuned to better 
match the characteristics observed in existing high-speed 
wide-area networks. 


3 Network-Sync Remote Mirroring 


Network-sync strikes a balance between performance 
and reliability, offering similar performance as semi- 
synchronous solutions, but with increased reliability. We 
use a forward-error correction protocol to increase the re- 
liability of high-quality optical links. For example, a link 
that drops one out of every | trillion bits or 125 million 
1 KB packets (this is the maximum error threshold be- 
yond which current carrier-grade optical equipment shuts 
down) can be pushed into losing less than 1 out of ev- 
ery 10!° packets by the simple expedient of sending each 
packet twice — a figure that begins to approach disk re- 
liability levels [7, 15]. By adding a callback when error 
recovery data has been sent, we can permit the applica- 
tion to resume execution once these encoded packets are 
sent, in effect treating the wide-area link as a kind of net- 
work disk. In this case, data is temporarily “stored” in 
the network while being shipped across the wide-area to 
the remote mirror. Figure | illustrates this capability. 
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One can imagine many ways of implementing this be- 
havior (e.g. datacenter gateway routers). In general, 
implementations of network-sync remote mirroring must 
satisfy two requirements. First, they should proactively 
enhance the reliability of the network, sending recovery 
data without waiting for any form of negative acknow]- 
edgment (e.g. TCP fast retransmit) or timeouts keyed 
to the round-trip-time (RTT) to the remote site. Second, 
they must expose the status of outgoing data, so that the 
sender can resume activity as soon as a desired level of 
in-flight redundancy has been achieved for pending up- 
dates. Section 3.1 discusses the network-sync option, 
Section 3.2 discusses an implementation of it, and Sec- 
tion 3.3 discusses its tolerance to disaster. 


3.1 Network-Sync Option 


Assuming that an external client interacts with a primary 
site and the primary site implements some higher level 
remote mirroring protocol, network-sync enhances that 
remote mirroring protocol as follows. First, a host lo- 
cated at the primary site submits a write request to a lo- 
cal storage system such as a disk array (e.g. [12]), stor- 
age area network (e.g. [19]), or file server (e.g. [31]). 
The local storage system simultaneously applies the re- 
quested operation to its local storage image and uses a 
reliable transport protocol such as TCP to forward the 
request to a storage system located at the remote mirror. 
To implement the network-sync option, an egress router 
located at the primary site forwards the IP packets asso- 
ciated with the request, sends additional error correcting 
packets to an ingress router located at the remote site, 
and then performs a callback, notifying the local storage 
system which of the pending updates are now safely in 
transit!. The local storage system then replies to the re- 
questing host, which can advance to any subsequent de- 
pendent operations. We assume that ingress and egress 
routers are under the control of site operators, thus can 
be modified to implement network-sync functionality. 

Later, perhaps 50ms or so may elapse before the 
remote mirror storage system receives the mirrored 
request—possibly after the network-sync layer has re- 
constructed one or more lost packets using the combina- 
tion of data received and error-recovery packets received. 
It applies the request to its local storage image, generates 
a storage level acknowledgment, and sends a response. 
Finally, when the primary storage system receives the re- 
sponse, perhaps 100ms later, it knows with certainty that 
the request has been mirrored and can garbage collect 
any remaining state (e.g. [19]). Notice that if a client re- 
quires the stronger assurances of a true remote-sync, the 
possibility exists of offering that guarantee selectively, 
on a per-operation basis. Figure 3 illustrates the network- 
sync mirroring option and Table | contrasts it to existing 
solutions. 
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Figure 3: Network-Sync Remote Mirroring Option. (1) A primary-site storage system simultaneously applies a request 
locally and forwards it to the remote mirror. After the network-sync layer (2) routes the request and sends additional 
error correcting packets, it (3) sends an acknowledgment to the local storage system — at this point, the storage system 
and application can safely move to the next operation. Later, (4) a remote mirror storage system receives the mirrored 
request—possibly after the network-sync layer recovered some lost packets. It applies the request to its local storage 
image, generates a storage level acknowledgment, and (5) sends a response. Finally, (7) when the primary storage 
system receives the response, it knows with certainty that the request has been mirrored and can gargage collect. 


Mirror Mirror Mirror-ack 
Solution Update Receive 


Mirror-ack 
Latency 


Rolling Disaster 


Local-only | Local + Local + Local+Mirror 
Failure Pckt Loss | NW Partition | Failure 


Asyne-orSemisyne [NA | 
Soa] WANT Maybe Toss 
nw-syne Feedback ()_| = Local ping 


Table 1: Comparison of Mirroring Protocols. 





3.2 Maelstrom: Network-sync Implemen- 
tation 


The network-sync implementation used in our work is 
based on Forward Error Correction (FEC). FEC is a 
generic term for a broad collection of techniques aimed 
at proactively recovering from packet loss or corruption. 
FEC implementations for data generated in real-time are 
typically parameterized by a rate (r,c): for every r data 
packets, c error correction packets are introduced into the 
stream. Of importance here is the fact that FEC perfor- 
mance is independent of link length (except to the extent 
that loss rates may be length-dependent). 


The specific FEC protocol we worked with is called 
Maelstrom [10], and is designed to match the observed 
loss properties of multi-hop wide-area networks such 
as TeraGrid. Maelstrom is a symmetric network appli- 
ance that resides between the datacenter and the wide- 
area link, much like a NAT box. The solution is com- 
pletely transparent to applications using it, and employs 
a mixture of technologies: routing tricks to conceal itself 
from the endpoints, a link-layer reliability protocol (cur- 
rently TCP), and a novel FEC encoding called layered 
interleaving, designed for data transfer over long-haul 
links with potentially bursty loss patterns. To minimize 
the rate-sensitivity of traditional FEC solutions, Mael- 
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strom aggregates all data flowing between the primary 
and backup sites an operates on the resulting high-speed 
stream. See Balakrishnan et al. [10] for a detailed de- 
scription of layered interleaving and analysis of its per- 
formance tolerance to random and bursty loss. 


Maelstrom also adds feedback notification callbacks. 
Every time Maelstrom transmits a FEC packet, it also is- 
sues a callback. The local storage system then employs 
a redundancy model to infer the level of safety associ- 
ated with in-flight data packets. For this purpose, a local 
storage system needs to know the underlying network’s 
properties — loss rate, burst length, etc. It uses these to 
model the behavior of Maelstrom mathematically [10], 
and then makes worst-case assumptions about network 
loss to arrive at the needed estimate of the risk of data 
loss. We expect system operators monitor network be- 
havior and periodically adjust Maelstrom parameters to 
adapt to any changes in the network characteristics. 


There are cases in which the Maelstrom FEC protocol 
is unable to repair the loss (this can only occur if several 
packets are lost, and in specific patterns that prevent us 
from using FEC packets for recovery). To address such 
loss patterns, we run our mirroring solution over TCP, 
which in turn runs over Maelstrom: if Maelstrom fails to 
recover a lost packet, the end-to-end TCP protocol will 
recover it from the sender. 
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3.3. Discussion 


The key metric for any disaster-tolerant remote mirror- 
ing technology is the distance by which datacenters can 
be separated. Today, a disturbing number of New York 
City banks maintain backups in New Jersey or Brooklyn, 
because they simply cannot tolerate higher latencies. 

The underlying problem is that these systems typically 
operate over TCP/IP. Obviously, the operators tune the 
system to match the properties of the network. For exam- 
ple, TCP can be configured to use massive sender buffers 
and unusually large segments; also, an application can be 
modified to employ multiple side-by-side streams (e.g. 
GridFTP). Yet even with such steps, the protocol remains 
purely reactive—trecovery packets are sent only in re- 
sponse to actual indications of failure, in the form of 
negative acknowledgments (i.e. fast retransmit) or time- 
outs keyed to the round-trip-time (RTT). Consequently, 
their recovery time is tightly linked to the distance be- 
tween communicating end-points. TCP/IP, for example, 
requires a minimum of around 1.5 RTTs to recover lost 
data, which translates into substantial fractions of a sec- 
ond if the mirrors are on different continents. No mat- 
ter how large we make the TCP buffers, the remote data 
stream will experience an RTT hiccup each time loss oc- 
curs: to deliver data in order, the receiver must await the 
missing data before subsequent packets can be delivered. 

Network-sync evades this RTT issue, but does not pro- 
tect the application against every possible rolling disaster 
scenario. Packets can still be queued in the local-area 
when disaster strikes. Further, the network can parti- 
tioned in the split second(s) before a primary site fails. 
Neither proactive redundancy or network-level callbacks 
will prevent loss in these cases. Accordingly, we en- 
vision that applications will need a mixture of remote- 
sync and network-sync, with the former reserved for par- 
ticularly sensitive scenarios, and the latter used in most 
cases. 

Another issue is failover and recovery. Since the 
network-sync option enhances remote mirroring proto- 
cols, we assume that a complete remote mirroring proto- 
col will itself handle failover and recovery directly [19, 
22, 20]. As a result, in this work, we focus on evaluating 
the fault tolerant capabilities of a network-sync option 
and do not discuss failover and recovery protocols. 


4 Mirroring Consistency via SMFS 


We will say that a mirror image is inconsistent if out of 
order updates are applied to the mirror, or the applica- 
tion updates a group of files, and a period ensues during 
which some of the mirrored copies reflect the updates but 
others are stale. Inconsistency is a well-known problem 
when using networks to access file systems, and the is- 
sue can be exacerbated when mirroring. For example, 
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Figure 4: Format of a log after writing a file sys- 
tem with two sub directories /dirl/filel and 
/dir2/file2. 


suppose that one were to mirror an NFS server, using the 
standard but unreliable UDP transport protocol. Primary 
and remote file systems can easily become inconsistent, 
since UDP packets can be reordered on the wire, par- 
ticularly if a packet is dropped and the NFS protocol is 
forced to resend it. Even if a reliable transport protocol is 
used, in cases where the file system is spread over multi- 
ple storage servers, or applications update groups of files, 
skew in update behavior between the different mirrored 
servers may be perceived as an inconsistency by applica- 
tions. 

To address this issue, SMFS implements a file sys- 
tem that preserves the order of operations in the struc- 
ture of the file system itself, a distributed log-structured 
file system (distributed-LFS)*, where a particular log is 
distributed over multiple disks. Similar to LFS [35, 27], 
it embeds a UNIX tree-structured file system into an ap- 
pend only log format (Figure 4). It breaks a particular log 
into multiple segments that each have a finite maximum 
size and are the units of storage allocation and cleaning. 

Although log-structured file systems may be unpopu- 
lar in general settings (due to worries about high cleaning 
costs if the file system fills up), a log structure turns out to 
be nearly ideal for file mirroring. First, it is well known 
that an append-only log-structure is optimized for write 
performance [27, 35]. Second, by combining data and 
order of operations into one structure — the log — iden- 
tical structures can be managed naturally at remote loca- 
tions. Finally, log operations can be pipelined, increasing 
system throughput. Of course, none of this eliminates 
worries about segment cleaning costs. Our assumption is 
that because SMFS would be used only for files that need 
to be mirrored, such as backups and checkpoints, it can 
be configured with ample capacity—far from the tipping 
point at which these overheads become problematic. 

In Sections 4.1 and 4.2, we describe the storage sys- 
tems architecture and API. 


4.1 SMES Architecture 


The SMFS architecture is illustrated in Figure 5. It works 
as follows. Clients access file system data by communi- 
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Figure 5: File System Architecture: Applications com- 
municate with the file server through (possibly) a NFS 
interface. The file server in turn communicates with the 
metadata server through the create () function call. 
The metadata server allocates space for the newly created 
log on storage servers that it selects. The file server then 
interacts directly with the storage server for append (), 
read(),and free () operations. 


cating with a file server (e.g. using the NFS protocol). 
File servers handle writes in a similar fashion to LFS. 
The log is updated by traversing a file system in a depth- 
first manner, first appending modified data blocks to the 
log, storing the log address in an inode, then appending 
the modified inode to the log, and so on [27, 29]. Reads 
are handled as in any conventional file system; starting 
with the root inode (stored in memory or a known loca- 
tion on disk) pointers are traversed to the desired file in- 
ode and data blocks. Although file servers provide a file 
system abstraction to clients, they are merely hosts in the 
storage system and stable storage resides with separate 
storage servers. 


4.2 SMFS API 


File servers interact with storage servers through a thin 
log interface—create(), append(), read(), and 
free(). create() communicates with a metadata 
server to allocate storage resources for a new log; it as- 
signs responsibility for the new log to a storage server. 
After a log has been created, a file server uses the 
append () operation to add data to the log. The file 
server communicates directly with a log’s storage server 
to append data. The storage server assigns the order of 
each append—assigns the address in the log to a par- 
ticular append—and atomically commits the operation. 
SMES maintains group-mirroring consistency, in which 
a single append() can contain updates to many dif- 
ferent files where the group of updates will all be re- 
flected by the storage system atomically, either all or 
none. read() returns the data associated with a log 
address. Finally, free () takes a log address and marks 
the address for later cleaning. In particular, after a block 
has been modified or file removed, the file system calls 
free () onall blocks that are no longer referenced. The 


create(), append(), and free() operations are 
mirrored between the primary site and remote mirror. 


5 Evaluation 


In this section, we evaluate the network-sync remote mir- 
roring option, running our SMFS prototype on Emu- 
lab [40] and the Cornell National Lambda Rail (NLR) 
Rings testbed [1]. 


5.1 Experimental Environment 


The implementation of SMFS that we worked was im- 
plemented as a user-mode application coded in Java. 
SMFS borrows heavily from our earlier file system, An- 
tiquity [39]; however, the log address was modified to be 
a segment identifier and offset into the segment. A hash 
of the block can optionally be computed, but it is used as 
a checksum instead of as part of the block address in the 
log. We focus our evaluation on the append () opera- 
tion since that is by far the dominant operation mirrored 
between two sites. 

SMES uses the Maelstrom network appliance [10] as 
the implementation of the network-sync option. Mael- 
strom can run as a user-mode module, but for the ex- 
periments reported here, it was dropped into the operat- 
ing system, where it runs as a Linux 2.6.20 kernel mod- 
ule with hooks into the kernel packet filter [2]. Packets 
destined for the opposite site are routed through a pair 
of Maelstrom appliances located at each site. More im- 
portantly, situating a network appliance at the egress and 
ingress router for each site creates a virtual link between 
the two sites, which presents many opportunities for in- 
creasing mirroring reliability and performance. 

The Maelstrom egress router captures packets, which 
it processes to create redundant packets. The original IP 
packets are forwarded unaltered; the redundant packets 
are then sent to the ingress router using a UDP channel. 
The ingress router captures and stores a window consist- 
ing of the last K IP packets that it has seen. Upon re- 
ceiving a redundant packet it checks it against the last K 
IP packets. If there is an opportunity to recover any lost 
IP packet it does so, and forwards the newly recovered 
IP packet through a raw socket to the intended destina- 
tion. Note that each appliance works in both egress and 
ingress mode since we handle duplex traffic. 

To implement network-sync redundancy feedback, the 
Maelstrom kernel module tracks each TCP flow and 
sends an acknowledgment to the sender. Each acknow]- 
edgment includes a byte offset from the beginning of the 
stream up to the most recent byte that was included in an 
error correcting packet that was sent to the ingress router. 

We used the TCP Reno congenstion control algorithm 
to communicate between mirrored storage systems for 
all experiments. We experimented with other congestion 
control algorithms such as cubic; however, the results 
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were nearly identical since we were measuring packets 
lost after a primary site failure due to a disaster. 

We tested the setup on Emulab [40]; our topology em- 
ulates two clusters of eight machines each, separated by 
a wide-area high capacity link with 50 to 200 ms RTT 
and | Gbps. Each machine has one 3.0 GHz Pentium 
64-bit Xeon processor with 2.0 GB of memory and a 146 
GB disks. Nodes are connected locally via a gigabit Eth- 
ernet switch. We apply load to these deployments using 
up to 64 testers located on the same cluster as the pri- 
mary. A single tester is an individual application that has 
only one outstanding request at a time. Figure 3 shows 
the topology of our Emulab experimental setup (with the 
difference that we used eight nodes per cluster, and not 
four). Throughout all subsequent experiments, link loss 
is random, independent and identically distributed. See 
Balakrishnan et al [10] for an analysis with bursty link 
loss. Finally, all experiments show the average and stan- 
dard deviation over five runs. 

The overall SMFS prototype is fast enough to saturate 
a gigabit wide-area link, hence our decision to work with 
a user-mode Java implementation has little bearing on 
the experiments we now report: even if SMFS was im- 
plemented in the kernel in C, the network would still be 
the bottleneck. 


5.2 Evaluation Metrics 


We identify the following three metrics to evaluate the 
efficacy of SMFS: 


¢ Data Loss: What happens in the event of a disas- 
ter at the primary? For varying loss rates on the 
wide-area link, how much does the mirror site di- 
verge from the primary? We want our system to 
minimize this divergence. 


¢ Latency: Latency can be used to measure both per- 
formance and reliability. Application-perceived la- 
tency measures (perceived) performance. Mirroring 
latency, on the other hand, measures reliability. In 
particular, the lower the latency, and the smaller the 
spread of its distribution, the better the fidelity of 
the mirror to the primary. 


¢ Throughput: Throughput is a good measure of per- 
formance. The property we desire from our sys- 
tem is that throughput should degrade gracefully 
with increasing link loss and latency. Also, for mir- 
roring solutions that use forward error correcting 
(FEC) codes, there is a fundamental tradeoff be- 
tween data reliability and goodput (i.e. application 
level throughput); proactive redundancy via FEC in- 
creases tolerance to link loss and latency, but re- 
duces the maximum goodput due to the overhead 
of FEC codes. We focus on goodput. 
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Table 2: Experimental Configuration Parameters 


For effective comparison, we define the following five 
configurations; all configurations use TCP to communi- 
cate between each pair of storage servers. 


¢ Local-syne: This is the canonical state-of-the-art 
solution. It is a semi-synchronous solution. As soon 
as the request has been applied to the local storage 
image and the local kernel buffers a request to send 
a message to the remote mirror, the local storage 
server responds to the application; it does not wait 
for feedback from remote mirror, or even for the 
packet to be placed on the wire. 


¢ Remote-syne: This is the other end of the spec- 
trum. It is a synchronous solution. The local stor- 
age server waits for a storage-level acknowledg- 
ment from the remote mirror before responding to 
the application. 


¢ Network-syne: This is SMFS running with a 
network-sync option, implemented by Maelstrom 
in the manner outlined in Section 3 (e.g. with 
TCP over FEC). The network-sync layer provides 
feedback after proactively injecting redundancy 
into the network. SMFS responds to the application 
after receiving these redundancy notification. 


* Local-sync+FEC: As a comparison point, this 
scheme is the local-sync mechanism, with Mael- 
strom running on the wide-area link, but without 
network-level callbacks to report when FEC pack- 
ets are placed on the wire (i.e. storage servers are 
unaware of the proactive redundancy). The local 
server permits the application to resume execution 
as soon as data has been written to the local storage 
system. 


* Remote-sync+FEC: As a second comparison 
point, this scheme is the remote-sync mechanism, 
again using Maelstrom on the wide-area link but 
without upcalls when FEC packets are sent. The 
local server waits for the remote storage system to 
acknowledge updates. 


These five SMFS configurations are evaluated on each 
of the above metrics, and their comparative performance 
is presented. The Network-sync, Local-sync+FEC, and 
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Figure 6: Data loss as a result of disaster and wide-area 
link failure, varying link loss (SOms one-way latency and 
FEC params (r,c) = (8,3)). 


Remote-sync+FEC configurations all use the Maelstrom 
layered interleaving forward error correction codes with 
parameters (7,c) = (8,3), which increases the tolerance 
to network transmission errors, but reduces the goodput 
by as much as 8/11 of the maximum throughput without 
any proactive redundancy. Table 2 lists the configuration 
parameters used in the experiments described below. 


5.3 Reliability During Disaster 


We measure reliability in two ways: 


¢ In the event of a disaster at the primary site, how 
much data loss results? 


¢ How much are the primary and mirror sites allowed 
to diverge? 


These questions are highly related; we distinguish be- 
tween them as follows: The maximum amount by which 
the primary and mirror sites can diverge is the extent of 
the bandwidth-delay product of the link between them; 
however, the amount of data lost in the event of fail- 
ure depends on how much of this data has been ac- 
knowledged to the application. In other words, how of- 
ten can we be caught in a lie? For instance, with a 
remote-sync solution (synchronous mirroring), though 
bandwidth-delay product — and hence primary-to-mirror 
divergence — may be high, data loss is zero. This, of 
course, is at severe cost to performance. With a local- 
sync solution (async- or semi-synchronous mirroring), 
on the other hand, data loss is equal to divergence. The 
following experiments show that the network-sync solu- 
tion with SMFS achieves a desirable mean between these 
two extremes. 


Disaster Test Figure 6 shows the amount of data loss 
in the event of a disaster for the local-sync, local- 
sync+FEC, and network-sync solutions; we do not test 
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Figure 7: Data loss as a result of disaster and wide-area 
link failure, varying FEC param c (50ms one-way la- 
tency, 1% link loss). 


the remote-sync and remote-sync+FEC solutions in this 
experiment since these solutions do not lose data. 

The rolling disaster, failure of the wide-area link and 
crash of all primary site processes, occurred two minutes 
into the experiment. The wide-area link operated at 0% 
loss until immediately before the disaster occurred, when 
loss rate was increased for 0.5 seconds, thereafter the link 
was killed (See Section 2 for a description of rolling dis- 
asters). The x-axis shows the wide-area link loss rate 
immediately before the link is killed; link losses are ran- 
dom, independent and identically distributed. The y-axis 
shows both the total number of messages sent and total 
number of messages lost—lost messages were perceived 
as durable by the application but were not received by 
the remote mirror. Messages were of size 4 kB. 

The total number of messages sent is similar for all 
configurations since the link loss rate was 0% for most 
of the experiment. However, local-sync lost a signif- 
icant number of messages that had been reported to 
the application as durable under the policy discussed in 
Section 3.1. These unrecoverable messages were ones 
buffered in the kernel, but still in transit on the wide area 
link; when the sending datacenter crashed and the link 
(independently) dropped the original copy of the mes- 
sage, TCP recovery was unable to overcome the loss. 

Local-sync+FEC lost packets as well: it lost packets 
still buffered in the kernel, but not packets that had al- 
ready been transmitted — in the latter case, the proac- 
tive redundancy mechanism was adequate to overcome 
the loss. The best outcome is visible in the right-most 
histogram at 0.1%, 0.5%, and 1% link loss: here we 
see that although the network-sync solution experienced 
the same level of link-induced message loss, all the lost 
packets that had been reported as durable to the sender 
application were in fact recovered on the receiver side 
of the link. This supports the premise that a network- 
sync solution can tolerate disaster while minimizing loss. 
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Figure 8: Latency distribution as a function of wide-area link loss (50ms one-way latency). 


Combined with results from Section 5.4, we demonstrate 
that the network-sync solution actually achieves the best 
balance between reliability and performance. 

Figure 7 quantifies the advantage of network-sync 
over local-sync+FEC. In this experiment, we run the 
same disaster scenario as above, but with 1% link loss 
during disaster and we vary the FEC parameter c (i.e. 
the number of recovery packets). At c = 0, there are no 
recovery packets for either local-sync+FEC or network- 
sync—if a data packet is lost during disaster, it cannot be 
recovered and TCP cannot deliver any subsequent data 
to the remote mirror process. Similarly, at c = 1, the 
number of lost packets is relatively high for both local- 
sync+FEC and network-sync since one recovery packet 
is not sufficient to mask 1% link loss. With c > 1, the 
number of recovery packets is often sufficient to mask 
loss on the wide-area link; however, local-sync+FEC 
loses data packets that did not transit outside the local- 
area before disaster, whereas with network-sync, primary 
storage servers respond to the client only after receiving 
a callback from the egress gateway. As a result, network- 
sync can potentially reduce data loss in a disaster. 


Latency Figure 8 shows how latency is distributed 
across all requests for local-sync, local-sync+FEC, and 
network-sync solutions. Latency is the time between a 
local storage server sending a request and a remote stor- 
age server receiving the request. We see that these solu- 
tions show similar latency for zero link loss, but local- 
sync+FEC and network-sync show considerably better 
latency than local-sync for a lossy link. Furthermore, 
the latency spread of local-sync+FEC and network-sync 
solutions is considerably less than the spread of the local- 
sync solution — particularly as loss increases; proactive 
redundancy helps to reduce latency jitter on lossy links. 
Smaller variance in this latency distribution helps to en- 
sure that updates submitted as a group will arrive at the 
remote site with minimum temporal skew, enabling the 
entire group to be written instead of not. 


5.4 Performance 


System Throughput Figure 9 compares the perfor- 
mance of the five different mirroring solutions. The x- 
axis shows loss probability on the wide-area link being 
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Figure 9: Effect of varying wide-area one-way link loss 
on Aggregate Throughput. 


increased from 0% to 1%, while the y-axis shows the 
throughput achieved by each of these mirroring solu- 
tions. All mirroring solutions use 64 testers over eight 
storage servers. 

At 0% loss we see that the local-sync and remote- 
sync solutions achieve the highest throughput because 
they do not use proactive redundancy, thus the good- 
put of the wide-area link is not reduced by the overhead 
of any forward error correcting packets. On the other 
hand, local-sync+FEC, remote-sync+FEC, and network- 
sync achieve lower throughput because the forward error 
correcting packets reduce the goodput in these cases. The 
forward error correction overhead is tunable; increasing 
FEC overhead often increases transmission reliability but 
reduces throughput. There is a slight degradation of per- 
formance for network-sync since SMFS waits for feed- 
back from the egress router instead of responding im- 
mediately after the local kernel buffers the send request. 
Finally, the remote-sync and remote-sync+FEC achieve 
comparable performance to all the other configurations 
since there is no loss on the wide-area link and the stor- 
age servers can saturate the link with overlapping mirror- 
ing requests. 

At higher loss rates, 0.1%, 0.5%, and 1%, we 
see that any solution that uses proactive redundancy 
(local-sync+FEC, remote-sync+FEC, and network-sync) 
achieves more than an order of magnitude higher 
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Figure 10: Effect of varying wide-area link latency on 
Aggregate Throughput. 
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Figure 11: Effect of varying wide-area link Joss on Per- 
Client Throughput. 


throughput over any solution that does not. This illus- 
trates the power of proactive redundancy, which makes it 
possible for these solutions to recover from lost packets 
at the remote mirror using locally-available data. Fur- 
ther, we observe that these proactive redundancy solu- 
tions perform comparably in both asynchronous and syn- 
chronous modes: in these experiments, the wide-area 
network is the bottleneck since overlapping operations 
can saturate the wide-area link. 

Figure 10 shows the system throughput of the 
network-sync solution as the wide-area one-way link la- 
tency increases from 25 ms to 100 ms. It demonstrates 
that the network-sync solution (or any solution that uses 
proactive redundancy) can effectively mask latency and 
loss of a wide-area link. 


Application Throughput The previous set of exper- 
iments studied system-level throughput, using a large 
number of testers. An interesting related study is pre- 
sented here, of individual-application throughput in each 
SMFS configuration. Figure 11 shows the effect of in- 
creasing loss probability on the throughput of a applica- 
tion, with only one outstanding request at a time. 
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Figure 12: Data loss as a result of disaster and wide-area 
link failure (Cornell NLR-Rings, 37 ms one-way delay). 


We see now that local-sync(+FEC) and network- 
sync solutions perform better than remote-sync(+FEC). 
The reason for this difference is that with asynchrony, 
network-sync can return an acknowledgment to the ap- 
plication as soon as a request is on the wide-area link, 
providing an opportunity to pipeline requests. This is in 
contrast to conventional asynchrony, where the applica- 
tion would receive an acknowledgment as soon as a re- 
quest is buffered. The advantage with the former is that 
it provides performance gain without hurting reliability. 
The disadvantage is that pure buffering is a local sys- 
tem call operation, which can return to the application 
sooner and can achieve higher throughput as seen by the 
local-sync(+FEC) solutions. However, this increase in 
throughput is at a sacrifice of reliability; any buffered 
data may be lost in the event of a crash before it is sent 
(See Figure 6). 


5.5 Cornell National Lambda Rail Rings 


In addition to our emulated setup and results, we are 
beginning to physically study systems that operate on 
dedicated lambda networks that might be seen in cut- 
ting edge financial, military, and educational settings. To 
study these “personal” lambda networks, we have cre- 
ated a new testbed consisting of optical network paths of 
varying physical length that start and end at Cornell, the 
Cornell National Lambda Rail (NLR) Rings testbed. 
The Cornell NLR-Rings testbed consists of three 
rings: a short ring that goes from Cornell to New York 
City and back, a medium ring that goes to Chicago down 
to Atlanta and back, and a Jong ring that goes to Seat- 
tle down to Los Angeles and back. The one-way latency 
is 7.9 ms, 37 ms, and 94 ms, for the short, medium, and 
long rings, respectively. The underlying optical network- 
ing technology is state-of-the-art: a 10 Gbps wide-area 
network running on dedicated fiber optics (separate from 
the public Internet) and created as a scientific research 
infrastructure by the NLR consortium [3]. Each ring 
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includes multiple segments of optical fiber, linked by 
routers and repeaters. More importantly, for the medium 
and long ring, each network packet traverses a unique 
path without going along the same segment. See NLR [3] 
for a map. 

Though all rings in the testbed are capable of 10 Gbps 
end-to-end, we are only able to operate at hundreds of 
megabits per second at this time due to network con- 
struction. Nonetheless, we are able to study the effects 
of disaster on dedicated wide-area lambda networks and 
hope to be able to use increasingly more bandwidth in 
the future. 

To study the effects of disaster in this wide-area 
testbed, we conduct the same disaster experiment de- 
scribed in Section 5.3. We induced loss on the wide-area 
link 0.5 second before the primary site fails via a router 
that we control. Later, when the primary site fails, the 
wide-area link and all processes are killed. Figure 12 
shows data loss during this disaster for the medium path 
on the Cornell NLR-Rings testbed. The x-axis shows 
the loss induced on the wide-area link (link losses are 
random, independent and identically distributed) and the 
y-axis shows the number of messages sent and the num- 
ber of unrecoverable messages. There are two interest- 
ing results illustrated. First, local-sync lost messages 
even when no loss was induced on the wide-area link. 
This may be because our wide-area testbed may drop 
packets, which prevents local-sync protocols from de- 
livering to the mirroring application. Local-sync+FEC 
and network-sync, on the other hand, did not lose mes- 
sages because both can mask wide-area link loss. Sec- 
ond, due to the relatively low bandwidth, packets were 
able to transit outside of the local-area, preventing loss 
from occurring in the local-area and enabling both local- 
sync+FEC and network-sync to mask wide-area link 
loss. 


6 Related Work 


6.1 Mirroring modes 


Synchronous mirroring, like IBM’s Peer-to-Peer Remote 
Copy (PPRC) [6] and EMC’s Symmetrix Remote Data 
Facility (SRDF) [12] is a technique often used in disas- 
ter tolerance solutions. It guarantees that local copies of 
data are consistent with copies at a remote site, and also 
guarantees that the mirror sites are as up-to-date as possi- 
ble. Naturally, the drawback is that of added I/O latency 
to every write operation; furthermore, long distance links 
make this technique prohibitively expensive. 

An alternate solution is to use asynchronous remote 
mirroring [19, 24, 31]. For example, SnapMirror [31] 
provides asynchronous mirroring of file systems by peri- 
odically transferring self-consistent data snapshots from 
a source volume to a destination volume. Users are pro- 
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vided with a knob for setting the frequency of updates — 
if set to a high value, the mirror would be nearly current 
with the source, while setting to a low value reduces the 
network bandwidth consumption at the risk of increased 
data loss. Seneca [19] is a storage area network mirror- 
ing solution and similarly attempts to reduce the amount 
of traffic sent over the wide-area network. 

SnapMirror works at the block level, using the 
WAFL [17] file system active block map to identify 
changed blocks and avoid sending deleted blocks. More- 
over, since it operates at this level, it is able to optimize 
data reads and writes. The authors showed that for up- 
date intervals as short as one minute, data transfers were 
reduced by 30% to 80%. 

Similar to SnapMirror, Seneca [19] is another asyn- 
chronous mirroring solution that attempts to reduce the 
traffic sent over the wide-area network, but also increases 
the risk of data loss. Seneca operates at the level of a stor- 
age area network (SAN) instead of the file system level. 

Semi-synchronous mirroring is yet another mode of 
operation, closely related to both synchronous and asyn- 
chronous mirroring. In such a mode, writes are sent to 
both the local and the remote storage sites at the same 
time, the I/O operation returning when the local write 
is completed. However subsequent write I/O is delayed 
until the completion of the preceding remote write com- 
mand. In [42] the authors show that by leveraging a log 
policy for the active remote write commands the system 
is able to allow a limited number of write I/O operations 
to proceed before waiting for acknowledgment from the 
remote site, thereby reducing the latency significantly. 


6.2 Error correcting codes 


Packet level forward error correcting (FEC) schemes typ- 
ically transmit c repair packets for every r data packets, 
using coding schemes with which all data packets can 
be reconstructed if at least r out of r+ c data and repair 
packets are received [18]. In contrast, convolution codes 
work on bit or symbol streams of arbitrary length, and 
are most often decoded with the Viterbi algorithm [38]. 
Our work favors FEC: FEC schemes have the benefit of 
being highly tunable — trading off overhead and timeli- 
ness, and are very stable under stress — provided that the 
recovery does not result in high levels of traffic. 

FEC techniques are increasingly popular. Recent ap- 
plications include FEC for multicasting data to large 
groups [34], where FEC can be employed either by re- 
ceivers [9] or senders [18, 28]. In general, fast, efficient 
encodings like Tornado codes [11] make sender-based 
FEC schemes very attractive in scenarios where dedi- 
cated senders distribute bulk data to a large number of 
receivers. 

Likewise, FEC can be used when connections experi- 
ence long transmission delays, in which case the use of 
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redundancy helps bound the delivery delays within some 
acceptable limits, even in the presence of errors [18, 33]. 
For example, deep space satellite communications [43] 
have been using error correcting codes for decades both 
for achieving maximal information transfer over a re- 
stricted bandwidth communication link and in the pres- 
ence of data corruption. 

SMFS is not the first system to propose exposing net- 
work state to higher level storage systems [32]. The 
difference, however, is that network-sync can be imple- 
mented with gateway routers under the control of site op- 
erators and does not require change to wide-area Internet 
routers. 


6.3 Reliable Storage & Recovery 


Recent studies have shown that failures plague stor- 
age and other components of large computing datacen- 
ters [36]. As a result, many systems replicate data to 
reduce risk of data loss [5, 14, 16, 25, 23, 37]. However, 
replication alone is not complete without recovery. 

Recovery in the face of disaster has been a problem 
that has received a lot of attention [13, 21, 22]. In [20], 
for example, the authors propose a reactive way to solve 
the data recovery scheduling problem once the disas- 
ter has occurred. Potential recovery processes are first 
mapped onto recovery graphs — the recovery graphs 
capture alternative approaches for recovering workloads, 
precedence relationships, timing constraints, etc. The re- 
covery scheduling problem is encoded as an optimization 
problem with the end goal of finding the schedule that 
minimizes some measure of penalty; several methods for 
finding optimal and near-optimal solutions are given. 

Aguilera et. al. [4] explore the tradeoff between the 
ability to recover and the cost of recovery in enterprise 
storage systems. They propose a multi-tier file system 
called TierFS that employs a “recoverability log” used 
to increase the recoverability of lower tiers by using the 
highest tier. 

Both LOCKSS [26] and Deep Store [44] address the 
problem of reliably preserving large volumes of data for 
virtually indefinite periods of time, dealing with threats 
like format obsolescence and “bit-rot.”, LOCKSS con- 
sists of a set of low-cost, independent, persistent coop- 
erating caches that use a voting scheme to detect and re- 
pair damaged content. Deep Store eliminates redundancy 
both within and across files; it distributes data for scal- 
ability and provides variable levels of replication based 
on the importance or the degree of dependency of each 
chunk of stored data. 

Baker et. al. [8] consider the problem of recovery from 
failure of long-term storage of digital information. They 
propose a “reliability model” encompassing latent and 
correlated faults, and the detection time of such latent 
faults. They show that a simple combination of audit- 


ing (to detect latent faults) as soon as possible, automatic 
recovery and independence of replicas yields the most 
benefit with respect to the cost of each technique. 


7 Conclusion 


The conundrum facing many disaster tolerance and re- 
covery designs is the tradeoff between loss of perfor- 
mance and the potential loss of data. On the one hand, 
it may not be desirable to slow application response time 
until it is assured that data will not be lost in the event of 
disaster. On the other hand, the prospect of data loss can 
be catastrophic for many companies and organizations. 
Unfortunately, there is not much of a middle ground in 
the design space and designers must choose one or the 
other. 

The network-sync remote mirroring option poten- 
tially offers an improvement, providing performance of 
enterprise-level semi-synchronous remote mirroring so- 
lutions while increasing their data reliability guarantees. 
Like native semi-synchronous protocols, network-sync 
protocols simultaneously send each update to the remote 
mirror as the primary handles the update locally. Rather 
than waiting for an acknowledgment from the remote 
mirror, it delays only until it receives feedback from 
an underlying communication layer, acknowledging that 
data and repair packets have been placed on the exter- 
nal wide-area network. This minimizes the loss of data 
in the event of disaster. Applications requiring strong 
remote-sync guarantees can still wait for a remote ac- 
knowledgment, but for most purposes, network-sync rep- 
resents an appealing new option. Our experiments show 
that SMFS, a remote mirroring solution that uses the 
network-sync option, exhibits performance that is inde- 
pendent of link-latency, in marked contrast to most exist- 
ing technologies. 
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Notes 


| Egress and ingress routers operate as gateway routers between dat- 
acenter and wide-area networks, where egress routers send packets 
from local datacenter networks to the wide-area network and ingress 
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routers receive packets from the wide-area network and forward pack- 
ets to local datacenter networks. Generally, egress routers also function 
as ingress routers and visa versa since they handle duplex traffic. 


2A distributed log-structured file system can expose an NES inter- 


face to hosts; however, it stores data in a distributed log-structured file 
system instead of a local UNIX file system (UFS). 
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Abstract 


In this paper we describe Cumulus, a system for effi- 
ciently implementing filesystem backups over the Inter- 
net. Cumulus is specifically designed under a thin cloud 
assumption—that the remote datacenter storing the back- 
ups does not provide any special backup services, but 
only provides a least-common-denominator storage in- 
terface (i.e., get and put of complete files). Cumulus 
aggregates data from small files for remote storage, and 
uses LFS-inspired segment cleaning to maintain storage 
efficiency. Cumulus also efficiently represents incremen- 
tal changes, including edits to large files. While Cumulus 
can use virtually any storage service, we show that its ef- 
ficiency is comparable to integrated approaches. 


1 Introduction 


It has become increasingly popular to talk of “cloud com- 
puting” as the next infrastructure for hosting data and de- 
ploying software and services. Not surprisingly, there 
are a wide range of different architectures that fall un- 
der the umbrella of this vague-sounding term, ranging 
from highly integrated and focused (e.g., Software As 
A Service offerings such as Salesforce.com) to decom- 
posed and abstract (e.g., utility computing such as Ama- 
zon’s EC2/S3). Towards the former end of the spectrum, 
complex logic is bundled together with abstract resources 
at a datacenter to provide a highly specific service— 
potentially offering greater performance and efficiency 
through integration, but also reducing flexibility and in- 
creasing the cost to switch providers. At the other end of 
the spectrum, datacenter-based infrastructure providers 
offer minimal interfaces to very abstract resources (e.g., 
“store file’), making portability and provider switching 
easy, but potentially incurring additional overheads from 
the lack of server-side application integration. 

In this paper, we explore this thin-cloud vs. thick- 
cloud trade-off in the context of a very simple applica- 
tion: filesystem backup. Backup is a particularly attrac- 
tive application for outsourcing to the cloud because it is 
relatively simple, the growth of disk capacity relative to 
tape capacity has created an efficiency and cost inflection 
point, and the cloud offers easy off-site storage, always 
a key concern for backup. For end users there are few 
backup solutions that are both trivial and reliable (espe- 
cially against disasters such as fire or flood), and ubiq- 


uitous broadband now provides sufficient bandwidth re- 
sources to offload the application. For small to mid-sized 
businesses, backup is rarely part of critical business pro- 
cesses and yet is sufficiently complex to “get right” that it 
can consume significant IT resources. Finally, larger en- 
terprises benefit from backing up to the cloud to provide 
a business continuity hedge against site disasters. 


However, to price cloud-based backup services attrac- 
tively requires minimizing the capital costs of data cen- 
ter storage and the operational bandwidth costs of ship- 
ping the data there and back. To this end, most exist- 
ing cloud-based backup services (e.g., Mozy, Carbonite, 
Symantec’s Protection Network) implement integrated 
solutions that include backup-specific software hosted 
on both the client and at the data center (usually using 
servers owned by the provider). In principle, this ap- 
proach allows greater storage and bandwidth efficiency 
(server-side compression, cleaning, etc.) but also reduces 
portability—locking customers into a particular provider. 


In this paper we explore the other end of the de- 
sign space—the thin cloud. We describe a cloud-based 
backup system, called Cumulus, designed around a min- 
imal interface (put, get, delete, list) that is triv- 
ially portable to virtually any on-line storage service. 
Thus, we assume that any application logic is imple- 
mented solely by the client. In designing and evaluat- 
ing this system we make several contributions. First, we 
show through simulation that, through careful design, it 
is possible to build efficient network backup on top of 
a generic storage service—competitive with integrated 
backup solutions, in spite of having no specific backup 
support in the underlying storage service. Second, we 
build a working prototype of this system using Amazon’s 
Simple Storage Service (S3) and demonstrate its effec- 
tiveness on real end-user traces. Finally, we describe how 
such systems can be tuned for cost instead of for band- 
width or storage, both using the Amazon pricing model 
as well as for a range of storage to network cost ratios. 


In the remainder of this paper, we first describe prior 
work in backup and network-based backup, followed by 
a design overview of Cumulus and an in-depth descrip- 
tion of its implementation. We then provide both simula- 
tion and experimental results of Cumulus performance, 
overhead, and cost in trace-driven scenarios. We con- 
clude with a discussion of the implications of our work 
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and how this research agenda might be further explored. 


2 Related Work 


Many traditional backup tools are designed to work well 
for tape backups. The dump, cpio, and tar [16] utilities 
are common on Unix systems and will write a full filesys- 
tem backup as a single stream of data to tape. These utili- 
ties may create a full backup of a filesystem, but also sup- 
port incremental backups, which only contain files which 
have changed since a previous backup (either full or an- 
other incremental). Incremental backups are smaller and 
faster to create, but mostly useless without the backups 
on which they are based. 

Organizations may establish backup policies specify- 
ing at what granularity backups are made, and how long 
they are kept. These policies might then be implemented 
in various ways. For tape backups, long-term backups 
may be full backups so they stand alone; short-term daily 
backups may be incrementals for space efficiency. Tools 
such as AMANDA [2] build on dump or tar, automating 
the process of scheduling full and incremental backups 
as well as collecting backups from a network of comput- 
ers to write to tape as a group. Cumulus supports flexible 
policies for backup retention: an administrator does not 
have to select at the start how long to keep backups, but 
rather can delete any snapshot at any point. 

The falling cost of disk relative to tape makes backup 
to disk more attractive, especially since the random ac- 
cess permitted by disks enables new backup approaches. 
Many recent backup tools, including Cumulus, take ad- 
vantage of this trend. Two approaches for comparing 
these systems are by the storage representation on disk, 
and by the interface between the client and the storage— 
while the disk could be directly attached to the client, of- 
ten (especially with a desire to store backups remotely) 
communication will be over a network. 

Rsync [22] efficiently mirrors a filesystem across a 
network using a specialized network protocol to identify 
and transfer only those parts of files that have changed. 
Both the client and storage server must have rsync in- 
stalled. Users typically want backups at multiple points 
in time, so rsnapshot [19] and other wrappers around 
rsync exist that will store multiple snapshots, each as a 
separate directory on the backup disk. Unmodified files 
are hard-linked between the different snapshots, so stor- 
age is space-efficient and snapshots are easy to delete. 

The rdiff-backup [7] tool is similar to rsnapshot, but 
it changes the storage representation. The most re- 
cent snapshot is a mirror of the files, but the rsync al- 
gorithm creates compact deltas for reconstructing older 
versions—these reverse incrementals are more space ef- 
ficient than full copies of files as in rsnapshot. 

Another modification to the storage format at the 
server is to store snapshots in a content-addressable stor- 
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age system. Venti [17] uses hashes of block contents 
to address data blocks, rather than a block number on 
disk. Identical data between snapshots (or even within a 
snapshot) is automatically coalesced into a single copy 
on disk—giving the space benefits of incremental back- 
ups automatically. Data Domain [26] offers a similar 
but more recent and efficient product; in addition to per- 
formance improvements, it uses content-defined chunk 
boundaries so de-duplication can be performed even if 
data is offset by less than the block size. 

A limitation of these tools is that backup data must be 
stored unencrypted at the server, so the server must be 
trusted. Box Backup [21] modifies the protocol and stor- 
age representation to allow the client to encrypt data be- 
fore sending, while still supporting rsync-style efficient 
network transfers. 

Most of the previous tools use a specialized protocol to 
communicate between the client and the storage server. 
An alternate approach is to target a more generic inter- 
face, such as a network file system or an FTP-like pro- 
tocol. Amazon S3 [3] offers an HTTP-like interface to 
storage. The operations supported are similar enough be- 
tween these different protocols—get/put/delete on files 
and list on directories—that a client can easily support 
multiple protocols. Cumulus tries to be network-friendly 
like rsync-based tools, while using only a generic storage 
interface. 

Jungle Disk [13] can perform backups to Amazon S3. 
However, the design is quite different from that of Cu- 
mulus. Jungle Disk is first a network filesystem with 
Amazon S3 as the backing store. Jungle Disk can also 
be used for backups, keeping copies of old versions of 
files instead of deleting them. But since it is optimized 
for random access it is less efficient than Cumulus for 
pure backup—features like aggregation in Cumulus can 
improve compression, but are at odds with efficient ran- 
dom access. 

Duplicity [8] aggregates files together before stor- 
age for better compression and to reduce per-file stor- 
age costs at the server. Incremental backups use space- 
efficient rsync-style deltas to represent changes. How- 
ever, because each incremental backup depends on the 
previous, space cannot be reclaimed from old snapshots 
without another full backup, with its associated large up- 
load cost. Cumulus was inspired by duplicity, but avoids 
this problem of long dependency chains of snapshots. 

Brackup [9] has a design very similar to that of Cu- 
mulus. Both systems separate file data from metadata: 
each snapshot contains a separate copy of file metadata 
as of that snapshot, but file data is shared where possi- 
ble. The split data/metadata design allows old snapshots 
to be easily deleted. Cumulus differs from Brackup pri- 
marily in that it places a greater emphasis on aggregating 
small files together for storage purposes, and adds a seg- 
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as £ 3& so 28 

a5 n £2 a ww 
rsync ¥ NA 
rsnapshot v v 
rdiff-backup | Vv v v 
Box Backup | Vv v veoev 
Jungle Disk Vv Vv v 
duplicity veev veoev 
Brackup Vv Vv v 
Cumulus Vv Vv voeoev 


Multiple snapshots: Can store multiple versions of files at dif- 
ferent points in time; Simple server: Can back up almost any- 
where; does not require special software at the server; Incre- 
mental forever: Only initial backup must be a full backup; 
Sub-file delta storage: Efficiently represents small differences 
between files on storage; only relevant if storing multiple snap- 
shots; Encryption: Data may be encrypted for privacy before 
sending to storage server. 


Table 1: Comparison of features among selected tools 
that back up to networked storage. 


ment cleaning mechanism to manage the inefficiency in- 
troduced. Additionally, Cumulus tries to efficiently rep- 
resent small changes to all types of large files and can 
share metadata where unchanged; both changes reduce 
the cost of incremental backups. 


Peer-to-peer systems may be used for storing back- 
ups. Pastiche [5] is one such system, and focuses on the 
problem of identifying and sharing data between differ- 
ent users. Pastiche uses content-based addressing for de- 
duplication. But if sharing is not needed, Brackup and 
Cumulus could use peer-to-peer systems as well, simply 
treating it as another storage interface offering get and 
put operations. 


While other interfaces to storage may be available— 
Antiquity [24] for example provides a log append 
operation—a get/put interface likely still works best 
since it is simpler and a single put is cheaper than multi- 
ple appends to write the same data. 


Table 1 summarizes differences between some of the 
tools discussed above for backup to networked storage. 
In relation to existing systems, Cumulus is most similar 
to duplicity (without the need to occasionally re-upload 
a new full backup), and Brackup (with an improved 
scheme for incremental backups including rsync-style 
deltas, and improved reclamation of storage space). 


3 Design 


In this section we present the design of our approach for 
making backups to a thin cloud remote storage service. 


3.1 Storage Server Interface 


We assume only a very narrow interface between a client 
generating a backup and a server responsible for storing 
the backup. The interface consists of four operations: 


Get: Given a pathname, retrieve the contents of a file 
from the server. 


Put: Store a complete file on the server with the given 
pathname. 


List: Get the names of files stored on the server. 


Delete: Remove the given file from the server, reclaim- 
ing its space. 


Note that all of these operations operate on entire files; 
we do not depend upon the ability to read or write arbi- 
trary byte ranges within a file. Cumulus neither requires 
nor uses support for reading and setting file attributes 
such as permissions and timestamps. The interface is 
simple enough that it can be implemented on top of any 
number of protocols: FTP, SFTP, WebDAV, S3, or nearly 
any network file system. 

Since the only way to modify a file in this narrow in- 
terface is to upload it again in full, we adopt a write- 
once storage model, in which a file is never modified 
after it is first stored, except to delete it to recover 
space. The write-once model provides convenient fail- 
ure guarantees: since files are never modified in place, 
a failed backup run cannot corrupt old snapshots. At 
worst, it will leave a partially-written snapshot which can 
garbage-collected. Because Cumulus does not modify 
files in place, we can keep snapshots at multiple points 
in time simply by not deleting the files that make up old 
snapshots. 


3.2 Storage Segments 


When storing a snapshot, Cumulus will often group data 
from many smaller files together into larger units called 
segments. Segments become the unit of storage on the 
server, with each segment stored as a single file. Filesys- 
tems typically contain many small files (both our traces 
described later and others, such as [1], support this ob- 
servation). Aggregation of data produces larger files for 
storage at the server, which can be beneficial to: 

Avoid inefficiencies associated with many small files: 
Storage servers may dislike storing many small files for 
various reasons—higher metadata costs, wasted space 
from rounding up to block boundaries, and more seeks 
when reading. This preference may be expressed in the 
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Snapshot Descriptors 


Date: 2008-01-01 12:00:00 
Root: A/0 
Segments: AB 


Date: 2008-01-02 12:00:00 
Root: C/0 
Segments: BC 









Segment Store 


Segment A Segment C 


Segment B 


eo taLer 
: root 
: B/O 


: file2 


: file2 


: root 
sJB/ Ee B/2 


: root 
: B/1 B/2 


Figure 1: Simplified schematic of the basic format for 
storing snapshots on a storage server. Two snapshots are 
shown, taken on successive days. Each snapshot contains 
two files. filel changes between the two snapshots, 
but the data for file2 is shared between the snapshots. 
For simplicity in this figure, segments are given letters as 
names instead of the 128-bit UUIDs used in practice. 


cost model of the provider. Amazon S3, for example, has 
both a per-request and a per-byte cost when storing a file 
that encourages using files greater than 100 KB in size. 

Avoid costs in network protocols: Small files result in 
relatively larger protocol overhead, and may be slower 
over higher-latency connections. Pipelining (if sup- 
ported) or parallel connections may help, but larger seg- 
ments make these less necessary. We study one instance 
of this effect in more detail in Section 5.4.5. 

Take advantage of inter-file redundancy with segment 
compression: Compression can be more effective when 
small files are grouped together. We examine this effect 
in Section 5.4.2. 

Provide additional privacy when encryption is used: 
Aggregation helps hide the size as well as contents of 
individual files. 

Finally, as discussed in Sections 3.4 and 4.3, changes 
to small parts of larger files can be efficiently repre- 
sented by effectively breaking those files into smaller 
pieces during backup. For the reasons listed above, 
re-aggregating this data becomes even more important 
when sub-file incremental backups are supported. 


3.3. Snapshot Format 


Figure | illustrates the basic format for backup snap- 
shots. Cumulus snapshots logically consist of two parts: 
a metadata log which lists all the files backed up, and the 
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file data itself. Both metadata and data are broken apart 
into blocks, or objects, and these objects are then packed 
together into segments, compressed as a unit and option- 
ally encrypted, and stored on the server. Each segment 
has a unique name—we use a randomly generated 128- 
bit UUID so that segment names can be assigned without 
central coordination. Objects are numbered sequentially 
within a segment. 


Segments are internally structured as a TAR file, with 
each file in the archive corresponding to an object in the 
segment. Compression and encryption are provided by 
filtering the raw segment data through gzip, bzip2, 
gpg, or other similar external tools. 


A snapshot can be decoded by traversing the tree (or, 
in the case of sharing, DAG) of objects. The root object 
in the tree is the start of the metadata log. The metadata 
log need not be stored as a flat object; it may contain 
pointers to objects containing other pieces of the meta- 
data log. For example, if many files have not changed, 
then a single pointer to a portion of the metadata for an 
old snapshot may be written. The metadata objects even- 
tually contain entries for individual files, with pointers to 
the file data as the leaves of the tree. 


The metadata log entry for each individual file speci- 
fies properties such as modification time, ownership, and 
file permissions, and can be extended to include addi- 
tional information if needed. It includes a cryptographic 
hash so that file integrity can be verified after a restore. 
Finally, it includes a list of pointers to objects containing 
the file data. Metadata is stored in a text, not binary, for- 
mat to make it more transparent. Compression applied to 
the segments containing the metadata, however, makes 
the format space-efficient. 


The one piece of data in each snapshot not stored in 
a segment is a snapshot descriptor, which includes a 
timestamp and a pointer to the root object. 


Starting with the root object stored in the snapshot de- 
scriptor and traversing all pointers found, a list of all 
segments required by the snapshot can be constructed. 
Since segments may be shared between multiple snap- 
shots, a garbage collection process deletes unreferenced 
segments when snapshots are removed. To simplify 
garbage-collection, each snapshot descriptor includes 
(though it is redundant) a summary of segments on which 
it depends. 


Pointers within the metadata log include cryptographic 
hashes so that the integrity of all data can be validated 
starting from the snapshot descriptor, which can be dig- 
itally signed. Additionally, Cumulus writes a summary 
file with checksums for all segments so that it can quickly 
check snapshots for errors without a full restore. 
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3.4 Sub-File Incrementals 


If only a small portion of a large file changes between 
snapshots, only the changed portion of the file should 
be stored. The design of the Cumulus format supports 
this. The contents of each file is specified as a list of 
objects, so new snapshots can continue to point to old 
objects when data is unchanged. Additionally, pointers 
to objects can include byte ranges to allow portions of 
old objects to be reused even if some data has changed. 
We discuss how our implementation identifies data that 
is unchanged in Section 4.3. 


3.5 Segment Cleaning 


When old snapshots are no longer needed, space is re- 
claimed by deleting the root snapshot descriptors for 
those snapshots, then garbage collecting unreachable 
segments. It may be, however, that some segments only 
contain a small fraction of useful data—the remainder 
of these segments, data from deleted snapshots, is now 
wasted space. This problem is similar to the problem 
of reclaiming space in the Log-Structured File System 
(LFS) [18]. 

There are two approaches that can be taken to seg- 
ment cleaning given that multiple backup snapshots are 
involved. The first, in-place cleaning, is most like the 
cleaning in LFS. It identifies segments with wasted space 
and rewrites the segments to keep just the needed data. 

This mode of operation has several disadvantages, 
however. It violates the write-once storage model, in 
that the data on which a snapshot depends is changed 
after the snapshot is written. It requires detailed book- 
keeping to determine precisely which data must be re- 
tained. Finally, it requires downloading and decrypting 
old segments—normal backups only require an encryp- 
tion key, but cleaning needs the decryption key as well. 

The alternative to in-place cleaning is to never mod- 
ify segments in old snapshots. Instead, Cumulus avoids 
referring to data in inefficient old segments when creat- 
ing a new snapshot, and writes new copies of that data 
if needed. This approach avoids the disadvantages listed 
earlier, but is less space-efficient. Dead space is not re- 
claimed until snapshots depending on the old segments 
are deleted. Additionally, until then data is stored re- 
dundantly since old and new snapshots refer to different 
copies of the same data. 

We analyzed both approaches to cleaning in simula- 
tion. We found that the cost benefits of in-place cleaning 
were not large enough to outweigh its disadvantages, and 
so our Cumulus prototype does not clean in place. 

The simplest policy for selecting segments to clean is 
to set a minimum segment utilization threshold, a, that 
triggers cleaning of a segment. We define utilization as 
the fraction of bytes within the segment which are ref- 


erenced by a current snapshot. For example, a = 0.8 
will ensure that at least 80% of the bytes in segments 
are useful. Setting a = 0 disables segment cleaning al- 
together. Cleaning thresholds closer to 1 will decrease 
storage overhead for a single snapshot, but this more ag- 
gressive cleaning requires transferring more data. 

More complex policies are possible as well, such as 
a cost-benefit evaluation that favors repacking long-lived 
segments. Cleaning may be informed by snapshot re- 
tention policies: cleaning is more beneficial immediately 
before creating a long-term snapshot, and cleaning can 
also consider which other snapshots currently reference a 
segment. Finally, segment cleaning may reorganize data, 
such as by age, when segments are repacked. 

Though not currently implemented, Cumulus could 
use heuristics to group data by expected lifetime when 
a backup is first written in an attempt to optimize seg- 
ment data for later cleaning (as in systems such as 
WOLF [23]). 


3.6 Restoring from Backup 


Restoring data from previous backups may take several 
forms. A complete restore extracts all files as they were 
on a given date. A partial restore recovers one or a small 
number of files, as in recovering from an accidental dele- 
tion. As an enhancement to a partial restore, all available 
versions of a file or set of files can be listed. 

Cumulus is primarily optimized for the first form of 
restore—recovering all files, such as in the event of the 
total loss of the original data. In this case, the restore pro- 
cess will look up the root snapshot descriptor at the date 
to restore, then download all segments referenced by that 
snapshot. Since segment cleaning seeks to avoid leaving 
much wasted space in the segments, the total amount of 
data downloaded should be only slightly larger than the 
size of the data to restore. 

For partial restores, Cumulus downloads those seg- 
ments that contain metadata for the snapshot to locate 
the files requested, then locates each of the segments 
containing file data. This approach might require fetch- 
ing many segments—for example, if restoring a directory 
whose files were added incrementally over many days— 
but will usually be quick. 

Cumulus is not optimized for tracking the history of 
individual files. The only way to determine the list of 
changes to a file or set of files is to download and process 
the metadata logs for all snapshots. However, a client 
could keep a database of this information to allow more 
efficient queries. 


3.7 Limitations 


Cumulus is not designed to replace all existing backup 
systems. As a result, there are situations in which other 
systems will do a better job. 
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The approach embodied by Cumulus is for the client 
making a backup to do most of the work, and leave the 
backup itself almost entirely opaque to the server. This 
approach makes Cumulus portable to nearly any type of 
storage server. However, a specialized backup server 
could provide features such as automatically repacking 
backup data when deleting old snapshots, eliminating the 
overhead of client-side segment cleaning. 

Cumulus, as designed, does not offer coordination be- 
tween multiple backup clients, and so does not offer fea- 
tures such as de-duplication between backups from dif- 
ferent clients. While Cumulus could use convergent en- 
cryption [6] to allow de-duplication even when data is 
first encrypted at the client, several issues prevent us 
from doing so. Convergent encryption would not work 
well with the aggregation in Cumulus. Additionally, 
server-side de-duplication is vulnerable to dictionary at- 
tacks to determine what data clients are storing, and stor- 
age accounting for billing purposes is more difficult. 

Finally, the design of Cumulus is predicated on the 
fact that backing up each file on the client to a sepa- 
rate file on the server may introduce too much overhead, 
and so Cumulus groups data together into segments. If 
it is known that the storage server and network protocol 
can efficiently deal with small files, however, then group- 
ing data into segments adds unnecessary complexity and 
overhead. Other disk-to-disk backup programs may be a 
better match in this case. 


4 Implementation 


We discuss details of the implementation of the Cumu- 
lus prototype in this section. Our implementation is rel- 
atively compact: only slightly over 3200 lines of C++ 
source code (as measured by SLOCCount [25]) imple- 
menting the core backup functionality, along with an- 
other roughly 1000 lines of Python for tasks such as re- 
stores, segment cleaning, and statistics gathering. 


4.1 Local Client State 


Each client stores on its local disk information about 
recent backups, primarily so that it can detect which 
files have changed and properly reuse data from previous 
snapshots. This information could be kept on the stor- 
age server. However, storing it locally reduces network 
bandwidth and improves access times. We do not need 
this information to recover data from a backup so its loss 
is not catastrophic, but this local state does enable vari- 
ous performance optimizations during backups. 

The client’s local state is divided into two parts: a local 
copy of the metadata log and an SQLite database [20] 
containing all other needed information. 

Cumulus uses the local copy of the previous metadata 
log to quickly detect and skip over unchanged files based 
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on modification time. Cumulus also uses it to delta- 
encode the metadata log for new snapshots. 

An SQLite database keeps a record of recent snapshots 
and all segments and objects stored in them. The table of 
objects includes an index by content hash to support data 
de-duplication. Enabling de-duplication leaves Cumulus 
vulnerable to corruption from a hash collision [11, 12], 
but, as with other systems, we judge the risk to be small. 
The hash algorithm (currently SHA-1) can be upgraded 
as weaknesses are found. In the event that client data 
must be recovered from backup, the content indices can 
be rebuilt from segment data as it is downloaded during 
the restore. 

Note that the Cumulus backup format does not specify 
the format of this information stored locally. It is entirely 
possible to create a new and very different implementa- 
tion which nonetheless produces backups conforming to 
the structure described in Section 3.3 and readable by our 
Cumulus prototype. 


4.2 Segment Cleaning 


The Cumulus backup program, written in C++, does 
not directly implement segment cleaning heuristics. In- 
stead, a separate Cumulus utility program, implemented 
in Python, controls cleaning. 

When writing a snapshot, Cumulus records in the local 
database a summary of all segments used by that snap- 
shot and the fraction of the data in each segment that is 
actually referenced. The Cumulus utility program uses 
these summaries to identify segments which are poorly- 
utilized and marks the selected segments as “expired” in 
the local database. It also considers which snapshots re- 
fer to the segments, and how long those snapshots are 
likely to be kept, during cleaning. On subsequent back- 
ups, the Cumulus backup program re-uploads any data 
that is needed from expired segments. Since the database 
contains information about the age of all data blocks, 
segment data can be grouped by age when it is cleaned. 

If local client state is lost, this age information will be 
lost. When the local client state is rebuilt all data will 
appear to have the same age, so cleaning may not be op- 
timal, but can still be done. 


4.3 Sub-File Incrementals 


As discussed in Section 3.4, the Cumulus backup format 
supports efficiently encoding differences between file 
versions. Our implementation detects changes by divid- 
ing files into small chunks in a content-sensitive manner 
(using Rabin fingerprints) and identifying chunks that are 
common, as in the Low-Bandwidth File System [15]. 
When a file is first backed up, Cumulus divides it into 
blocks of about a megabyte in size which are stored indi- 
vidually in objects. In contrast, the chunks used for sub- 
file incrementals are quite a bit smaller: the target size is 
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4 KB (though variable, with a2 KB minimum and 64 KB 
maximum). Before storing each megabyte block, Cumu- 
lus computes a set of chunk signatures: it divides the 
data block into non-overlapping chunks and computes a 
(20-byte SHA-1 signature, 2-byte length) tuple for each 
chunk. The list of chunk signatures for each object is 
stored in the local database. These signatures consume 
22 bytes for every roughly 4 KB of original data, so the 
signatures are about 0.5% of the size of the data to back 
up. 

Unlike LBFS, we do not create a global index of 
chunk hashes—to limit overhead, we do not attempt to 
find common data between different files. When a file 
changes, we limit the search for unmodified data to the 
chunks in the previous version of the file. Cumulus com- 
putes chunk signatures for the new file data, and matches 
with old chunks are written as a reference to the old data. 
New chunks are written out to a new object. However, 
Cumulus could be extended to perform global data de- 
duplication while maintaining backup format compati- 
bility. 


4.4 Segment Filtering and Storage 


The core Cumulus backup implementation is only capa- 
ble of writing segments as uncompressed TAR files to 
local disk. Additional functionality is implemented by 
calling out to external scripts. 

When performing a backup, all segment data may be 
filtered through a specified command before writing it. 
Specifying a program such as gzip can provide com- 
pression, or gpg can provide encryption. 

Similarly, network protocols are implemented by call- 
ing out to external scripts. Cumulus first writes segments 
to a temporary directory, then calls an upload script to 
transfer them in the background while the main backup 
process continues. Slow uploads will eventually throttle 
the backup process so that the required temporary stor- 
age space is bounded. Upload scripts may be quite sim- 
ple; a script for uploading to Amazon S3 is merely 12 
lines long in Python using the boto [4] library. 


4.5 Snapshot Restores 


The Cumulus utility tool implements complete restore 
functionality. This tool can automatically decompress 
and extract objects from segments, and can efficiently 
extract just a subset of files from a snapshot. 

To reduce disk space requirements, the restore tool 
downloads segments as needed instead of all at once 
at the start, and can delete downloaded segments as it 
goes along. The restore tool downloads the snapshot de- 
scriptor first, followed by the metadata. The backup tool 
segregates data and metadata into separate segments, so 
this phase does not download any file data. Then, file 
contents are restored—based on the metadata, as each 


segment is downloaded data from that segment is re- 
stored. For partial restores, only the necessary segments 
are downloaded. 

Currently, in the restore tool it is possible that a seg- 
ment may be downloaded multiple times if blocks for 
some files are spread across many segments. However, 
this situation is merely an implementation issue and can 
be fixed by restoring data for these files non-sequentially 
as it is downloaded. 

Finally, Cumulus includes a FUSE [10] interface that 
allows a collection of backup snapshots to be mounted as 
a Virtual filesystem on Linux, thereby providing random 
access with standard filesystem tools. This interface re- 
lies on the fact that file metadata is stored in sorted order 
by filename, so a binary search can quickly locate any 
specified file within the metadata log. 


5 Evaluation 


We use both trace-based simulation and a prototype im- 
plementation to evaluate the use of thin cloud services 
for remote backup. Our goal is to answer three high-level 
sets of questions: 


e What is the penalty of using a thin cloud service 
with a very simple storage interface compared to a 
more sophisticated service? 


e What are the monetary costs for using remote 
backup for two typical usage scenarios? How 
should remote backup strategies adapt to minimize 
monetary costs as the ratio of network and storage 
prices varies? 


e How does our prototype implementation compare 
with other backup systems? What are the additional 
benefits (e.g., compression, sub-file incrementals) 
and overheads (e.g., metadata) of an implementa- 
tion not captured in simulation? What is the perfor- 
mance of using an online service like Amazon S3 
for backup? 


The following evaluation sections answer these ques- 
tions, beginning with a description of the trace workloads 
we use as inputs to the experiments. 


5.1 Trace Workloads 


We use two traces as workloads to drive our evaluations. 
A fileserver trace tracks all files stored on our research 
group fileserver, and models the use of a cloud service 
for remote backup in an enterprise setting. A user trace 
is taken from the Cumulus backups of the home directory 
of one of the author’s personal computers, and models 
the use of remote backup in a home setting. The traces 
contain a daily record of the metadata of all files in each 
setting, including a hash of the file contents. The user 
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Fileserver User 
Duration (days) 157 223 
Entries 26673083 122007 
Files 24344167 116426 
File Sizes 
Median 0.996 KB 4.4 KB 
Average 153KB 214KB 
Maximum 54.1GB  169MB 
Total 3.47TB 2.37GB 
Update Rates 
New data/day 9.50GB_ 10.3 MB 
Changed data/day 805 MB 29.9 MB 
Total data/day 10.3GB 40.2 MB 


Table 2: Key statistics of the two traces used in the eval- 
uations. File counts and sizes are for the last day in the 
trace. “Entries” is files plus directories, symlinks, etc. 


trace further includes complete backups of all file data, 
and enables evaluation of the effects of compression and 
sub-file incrementals. Table 2 summarizes the key statis- 
tics of each trace. 


5.2 Remote Backup to a Thin Cloud 


First we explore the overhead of using remote backup to 
a thin cloud service that has only a simple storage inter- 
face. We compare this thin service model to an “optimal” 
model representing more sophisticated backup systems. 

We use simulation for these experiments, and start by 
describing our simulator. We then define our optimal 
baseline model and evaluate the overhead of using a sim- 
ple interface relative to a more sophisticated system. 


5.2.1 Cumulus Simulator 


The Cumulus simulator models the process of backing 
up collections of files to a remote backup service. It uses 
traces of daily records of file metadata to perform back- 
ups by determining which files have changed, aggregat- 
ing changed file data into segments for storage on a re- 
mote service, and cleaning expired data as described in 
Section 3. We use a simulator, rather than our prototype, 
because a parameter sweep of the space of cleaning pa- 
rameters on datasets as large as our traces is not feasible 
in areasonable amount of time. 

The simulator tracks three overheads associated with 
performing backups. It tracks storage overhead, or the 
total number of bytes to store a set of snapshots com- 
puted as the sum of the size of each segment needed. 
Storage overhead includes both actual file data as well as 
wasted space within segments. It tracks network over- 
head, the total data that must be transferred over the net- 
work to accomplish a backup. On graphs, we show this 
overhead as a cumulative value: the total data transferred 
from the beginning of the simulation until the given day. 
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Since remote backup services have per-file charges, the 
simulator also tracks segment overhead as the number of 
segments created during the process of making backups. 

The simulator also models two snapshot scenarios. 
In the single snapshot scenario, the simulator maintains 
only one snapshot remotely and it deletes all previous 
snapshots. In the multiple snapshot scenario, the sim- 
ulator retains snapshots according to a pre-determined 
backup schedule. In our experiments, we keep the most 
recent seven daily snapshots, with additional weekly 
snapshots retained going back farther in time so that a 
total of 12 snapshots are kept. This schedule emulates 
the backup policy an enterprise might employ. 

The simulator makes some simplifying assumptions 
that we explore later when evaluating our implementa- 
tion. The simulator detects changes to files in the traces 
using a per-file hash. Thus, the simulator cannot detect 
changes to only a portion of a file, and assumes that 
the entire file is changed. The simulator also does not 
model compression or metadata. We account for sub- 
file changes, compression, and metadata overhead when 
evaluating the prototype in Section 5.4. 


5.2.2 Optimal Baseline 


A simple storage interface for remote backup can incur 
an overhead penalty relative to more sophisticated ap- 
proaches. To quantify the overhead of this approach, we 
use an idealized optimal backup as a basis of comparison. 

For our simulations, the optimal backup is one in 
which no more data is stored or transferred over the net- 
work than is needed. Since simulation is done at a file 
granularity, the optimal backup will transfer the entire 
contents of a file if any part changes. Optimal backup 
will, however, perform data de-duplication at a file level, 
storing only one copy if multiple files have the same hash 
value. In the optimal backup, no space is lost to frag- 
mentation when deleting old snapshots. Cumulus could 
achieve this optimal performance in this simulation by 
storing each file in a separate segment—that is, to never 
bundle files together into larger segments. As discussed 
in Section 3.2 and as our simulation results show, though, 
there are good reasons to use segments with sizes larger 
than the average file. 

As an example of these costs and how we measure 
them, Figure 2(a) shows the optimal storage and upload 
overheads for daily backups of the 223 days of the user 
trace. In this simulation, only a single snapshot is re- 
tained each day. Storage grows slowly in proportion to 
the amount of data in a snapshot, and the cumulative net- 
work transfer grows linearly over time. 

Figure 2(b) shows the results of two simulations of Cu- 
mulus backing up the same data. The graph shows the 
overheads relative to optimal backup; a backup as good 
as optimal would have 0% relative overhead. These re- 
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Figure 2: (a) Storage and network overhead for an optimal backup of the files from the user trace. (b) Overheads with 
and without cleaning; segments are cleaned at 60% utilization. Only storage overheads are shown for the no-cleaning 
case since there is no network transfer overhead without cleaning. 


sults clearly demonstrate the need for cleaning when us- 
ing a simple storage interface for backup. When seg- 
ments are not cleaned (only deleting segments that by 
chance happen to be entirely no longer needed), wasted 
storage space grows quickly with time—by the end of 
the simulation at day 223, the size of a snapshot is nearly 
double the required size. In contrast, when segments 
are marked for cleaning at the 60% utilization thresh- 
old, storage overhead quickly stabilizes below 10%. The 
overhead in extra network transfers is similarly modest. 


5.2.3 Cleaning Policies 


Cleaning is clearly necessary for efficient backup, but it 
is also parameterized by two metrics: the size of the seg- 
ments used for aggregation, transfer, and storage (Sec- 
tion 3.2), and the threshold at which segments will be 
cleaned (Section 3.5). In our next set of experiments, 
we explore the parameter space to quantify the impact of 
these two metrics on backup performance. 

Figures 3 and 4 show the simulated overheads of 
backup with Cumulus using the fileserver and user 
traces, respectively. The figures show both relative over- 
heads to optimal backup (left y-axis) as well as the abso- 
lute overheads (right y-axis). We use the backup policy 
of multiple daily and weekly snapshots as described in 
Section 5.2.1. The figures show cleaning overhead for a 
range of cleaning thresholds and segment sizes. Each fig- 
ure has three graphs corresponding to the three overheads 
of remote backup to an online service. Average daily 
storage shows the average storage requirements per day 
over the duration of the simulation; this value is the total 
storage needed for tracking multiple backup snapshots, 
not just the size of a single snapshot. Similarly, average 
daily upload is the average of the data transferred each 
day of the simulation, excluding the first; we exclude the 
first day since any backup approach must transfer the en- 
tire initial filesystem. Finally, average segments per day 
tracks the number of new segments uploaded each day to 
account for per-file upload and storage costs. 


Storage and upload overheads improve with decreas- 
ing segment size, but at small segment sizes (< 1 MB) 
backups require very large numbers of segments and 
limit the benefits of aggregating file data (Section 3.2). 
As expected, increasing the cleaning threshold increases 
the network upload overhead. Storage overhead with 
multiple snapshots, however, has an optimum cleaning 
threshold value. Increasing the threshold initially de- 
creases storage overhead, but high thresholds increase it 
again; we explore this behavior further below. 

Both the fileserver and user workloads exhibit simi- 
lar sensitivities to cleaning thresholds and segment sizes. 
The user workload has higher overheads relative to op- 
timal due to smaller average files and more churn in the 
file data, but overall the overhead penalties remain low. 

Figures 3(a) and 4(a) show that there is a cleaning 
threshold that minimizes storage overheads. Increasing 
the cleaning threshold intuitively reduces storage over- 
head relative to optimal since the more aggressive clean- 
ing at higher thresholds will delete wasted space in seg- 
ments and thereby reduce storage requirements. 

Figure 5 explains why storage overhead increases 
again at higher cleaning thresholds. It shows three 
curves, the 16 MB segment size curve from Figure 3(a) 
and two curves that decompose the storage overhead into 
individual components (Section 3.5). One is overhead 
due to duplicate copies of data stored over time in the 
cleaning process; cleaning at lower thresholds reduces 
this component. The other is due to wasted space in seg- 
ments which have not been cleaned; cleaning at higher 
thresholds reduces this component. A cleaning threshold 
near the middle, however, minimizes the sum of both of 
these overheads. 


5.3 Paying for Remote Backup 


The evaluation in the previous section measured the over- 
head of Cumulus in terms of storage, network, and seg- 
ment resource usage. Remote backup as a service, how- 
ever, comes at a price. In this section, we calculate 
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Figure 5: Detailed breakdown of storage overhead when 
using a 16 MB segment size for the fileserver workload. 


Table 3: Costs for backups in US dollars, if performed 
optimally, for the fileserver and user traces using current 
prices for Amazon S3. 


Storage: $0.15 per GB - month 
Upload: $0.10 per GB 
monetary costs for our two workload models, evaluate Segment: $0.01 per 1000 files uploaded 


cleaning threshold and segment size in terms of costs in- 
stead of resource usage, and explore how cleaning should 
adapt to minimize costs as the ratio of network and stor- 
age prices varies. While similar, there are differences be- 
tween this problem and the typical evaluation of cleaning 
policies for a typical log-structured file system: instead 
of a fixed disk size and a goal to minimize I/O, we have 
no fixed limits but want to minimize monetary cost. 


We use the prices for Amazon S3 as an initial point in 
the pricing space. As of January 2009, these prices are 
(in US dollars): 
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With this pricing model, the segment cost for upload- 
ing an empty file is equivalent to the upload cost for up- 
loading approximately 100 KB of data, i.e., when up- 
loading 100 KB files, half of the cost is for the band- 
width and half for the upload request itself. As the file 
size increases, the per-request component becomes an in- 
creasingly smaller part of the total cost. 


Neglecting for the moment the segment upload costs, 
Table 3 shows the monthly storage and upload costs for 
each of the two traces. Storage costs dominate ongo- 
ing costs. They account for about 95% and 78% of the 
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Figure 6: Costs in US dollars for backups in the fileserver 
assuming Amazon S3 prices. Costs for the user trace 
differ in absolute values but are qualitatively similar. 


monthly costs for the fileserver and user traces, respec- 
tively. Thus, changes to the storage efficiency will have 
a more substantial effect on total cost than changes in 
bandwidth efficiency. We also note that the absolute 
costs for the home backup scenario are very low, indi- 
cating that Amazon’s pricing model is potentially quite 
reasonable for consumers: even for home users with an 
order of magnitude more data to backup than our user 
workload, yearly ongoing costs are roughly US$50. 


Whereas Figure 3 explored the parameter space of 
cleaning thresholds and segment sizes in terms of re- 
source overhead, Figure 6 shows results in terms of over- 
all cost for backing up the fileserver trace. These re- 
sults show that using a simple storage interface for re- 
mote backup also incurs very low additional monetary 
cost than optimal backup, from 0.5—2% for the fileserver 
trace depending on the parameters, and as low as about 
5% in the user trace. 


When evaluated in terms of monetary costs, though, 
the choices of cleaning parameters change compared to 
the parameters in terms of resource usage. The cleaning 
threshold providing the minimum cost is smaller and less 
aggressive (threshold = 0.4) than in terms of resource 
usage (threshold = 0.6). However, since overhead is not 
overly sensitive to the cleaning threshold, Cumulus still 
provides good performance even if the cleaning thresh- 
old is not tuned optimally. Furthermore, in contrast to 
resource usage, decreasing segment size does not always 
decrease overall cost. At some point—in this case be- 
tween 1-4 MB—decreasing segment size increases over- 
all cost due to the per-file pricing. We do not evalu- 
ate segment sizes less than | MB for the fileserver trace 
since, by | MB, smaller segments are already a loss. The 
results for the user workload, although not shown, are 
qualitatively similar, with a segment size of 0.5 MB to 
1 MB best. 


The pricing model of Amazon S3 is only one point 
in the pricing space. As a final cost experiment, we ex- 
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Figure 7: How the optimal threshold for cleaning 
changes as the relative cost of storage vs. network varies. 


plore how cleaning should adapt to changes in the rel- 
ative price of storage versus network. Figure 7 shows 
the optimal cleaning threshold for the fileserver and user 
workloads as a function of the ratio of storage to net- 
work cost. The storage to network ratio measures the 
relative cost of storing a gigabyte of data for a month 
and uploading a gigabyte of data. Amazon S3 has a ra- 
tio of 1.5. In general, as the cost of storage increases, 
it becomes advantageous to clean more aggressively (the 
optimal cleaning threshold increases). The ideal thresh- 
old stabilizes around 0.5—0.6 when storage is at least ten 
times more expensive than network upload, since clean- 
ing too aggressively will tend to increase storage costs. 


5.4 Prototype Evaluation 


In our final set of experiments, we compare the overhead 
of the Cumulus prototype implementation with other 
backup systems. We also evaluate the sensitivity of com- 
pression on segment size, the overhead of metadata in 
the implementation, the performance of sub-file incre- 
mentals and restores, and the time it takes to upload data 
to a remote service like Amazon S3. 


5.4.1. System Comparisons 


First, we provide some results from running our Cumulus 
prototype and compare with two existing backup tools 
that also target Amazon S3: Jungle Disk and Brackup. 
We use the complete file contents included in the user 
trace to accurately measure the behavior of our full Cu- 
mulus prototype and other real backup systems. For each 
day in the first three months of the user trace, we extract 
a full snapshot of all files, then back up these files with 
each of the backup tools. We compute the average cost, 
per month, broken down into storage, upload bandwidth, 
and operation count (files created or modified). 

We configured Cumulus to clean segments with less 
then 60% utilization on a weekly basis. We eval- 
uate Brackup with two different settings. The first 
uses the merge_files_under=1kB option to only 
aggregate files if those files are under | KB in size 
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System | Storage Upload Operations 
Jungle Disk | +2 GB 1.26GB 30000 

$0.30 $0.126 $0.30 
Brackup 1.340GB 0.760GB 9027 
(default) $0.201 $0.076 $0.090 
Brackup 1.353 GB 0.713GB_ 1403 
(aggregated) | $0.203 $0.071 $0.014 
Cumulus 1.264GB 0465GB 419 

$0.190 $0.047 $0.004 


Table 4: Cost comparison for backups based on replaying 
actual file changes in the user trace over a three month 
period. Costs for Cumulus are lower than those shown in 
Table 3 since that evaluation ignored the possible bene- 
fits of compression and sub-file incrementals, which are 
captured here. Values are listed on a per-month basis. 


(this setting is recommended). Since this setting still 
results in many small files (many of the small files 
are still larger than | KB), a “high aggregation” run 
sets merge_files_under=16kB to capture most of 
the small files and further reduce the operation count. 
Brackup includes the digest database in the files backed 
up, which serves a role similar to the database Cumulus 
stores locally. For fairness in the comparison, we sub- 
tract the size of the digest database from the sizes re- 
ported for Brackup. 

Both Brackup and Cumulus use gpg to encrypt data 
in the test; gpg compresses the data with gzip prior to 
encryption. Encryption is enabled in Jungle Disk, but no 
compression is available. 

In principle, we would expect backups with Jungle 
Disk to be near optimal in terms of storage and upload 
since no space is wasted due to aggregation. But, as a 
tradeoff, Jungle Disk will have a much higher operation 
count. In practice, Jungle Disk will also suffer from a 
lack of de-duplication, sub-file incrementals, and com- 
pression. 

Table 4 compares the estimated backup costs for Cu- 
mulus with Jungle Disk and Brackup. Several key points 
stand out in the comparison: 


e Storage and upload requirements for Jungle Disk 
are larger, owing primarily to the lack of compres- 
sion. 


Except in the high aggregation case, both Brackup 
and Jungle Disk incur a large cost due to the many 
small files stored to $3. The per-file cost for uploads 
is larger than the per-byte cost, and for Jungle Disk 
significantly so. 


e Brackup stores a complete copy of all file metadata 
with each snapshot, which in total accounts for 150-— 
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200 MB/month of the upload cost. The cost in Cu- 
mulus is lower since Cumulus can re-use metadata. 


Comparing storage requirements of Cumulus with the 
average size of a full backup with the venerable tar 
utility, both are within 1%: storage overhead in Cumu- 
lus is roughly balanced out by gains achieved from de- 
duplication. Using duplicity as a proxy for near-optimal 
incremental backups, in a test with two months from the 
user trace Cumulus uploads only about 8% more data 
than is needed. Without sub-file incrementals in Cumu- 
lus, the figure is closer to 33%. 

The Cumulus prototype thus shows that a service with 
a simple storage interface can achieve low overhead, and 
that Cumulus can achieve a lower total cost than other 
existing backup tools targeting S3. 

While perhaps none of the systems are yet optimized 
for speed, initial full backups in Brackup and Jungle Disk 
were both notably slow. In the tests, the initial Jungle 
Disk backup took over six hours, Brackup (to local disk, 
not S3) took slightly over two hours, and Cumulus (to 
S3) approximately 15 minutes. For comparison, simply 
archiving all files with tar to local disk took approxi- 
mately 10 minutes. 

For incremental backups, elapsed times for the tools 
were much more comparable. Jungle Disk averaged 248 
seconds per run archiving to S3. Brackup averaged 115 
seconds per run and Cumulus 167 seconds, but in these 
tests each were storing snapshots to local disk rather than 
to Amazon S3. 


5.4.2 Segment Compression 


Next we isolate the effectiveness of compression at re- 
ducing the size of the data to back up, particularly as a 
function of segment size and related settings. We used 
as a sample the full data contained in the first day of the 
user trace: the uncompressed size is 1916 MB, the com- 
pressed tar size is 1152 MB (factor of 1.66), and files 
individually compressed total 1219 MB (1.57x), 5.8% 
larger than whole-snapshot compression. 

When aggregating data together into segments, we 
found that larger input segment sizes yielded better com- 
pression, up to about 300 KB when using gzip and I- 
2 MB for bzip2 where compression ratios leveled off. 


5.4.3 Metadata 


The Cumulus prototype stores metadata for each file in a 
backup snapshot in a text format, but after compression 
the format is still quite efficient. In the full tests on the 
user trace, the metadata for a full backup takes roughly 
46 bytes per item backed up. Since most items include a 
20-byte hash value which is unlikely to be compressible, 
the non-checksum components of the metadata average 
under 30 bytes per file. 
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File A File B 

File size 4.860 MB 5.890 MB 
Compressed size | 1.547MB_ 2.396 MB 
Cumulus size 5.190 MB 3.081 MB 
Size overhead 235% 29% 

rdiff delta 1.421MB 122 KB 
Cumulus delta 1.527MB_ 181 KB 
Delta overhead 1% 48% 





Table 5: Comparison of Cumulus sub-file incrementals 
with an idealized system based on rdiff, evaluated on two 
sample files from the user trace. 


Metadata logs can be stored incrementally: new snap- 
shots can reference the portions of old metadata logs that 
are not modified. In the full user trace replay, a full 
metadata log was written to a snapshot weekly. On days 
where only differences were written out, though, the av- 
erage metadata log delta was under 2% of the size of a 
full metadata log. Overall, across all the snapshots taken, 
the data written out for file metadata was approximately 
5% of the total size of the file data itself. 


5.4.4 Sub-File Incrementals 


To evaluate the support for sub-file incrementals in Cu- 
mulus, we make use of files extracted from the user trace 
that are frequently modified in place. We extract files 
from a 30-day period at the start of the trace. File A is 
a frequently-updated Bayesian spam filtering database, 
about 90% of which changes daily. File B records the 
state for a file-synchronization tool (unison), of which an 
average of 5% changes each day—however, unchanged 
content may still shift to different byte offsets within the 
file. While these samples do not capture all behavior, 
they do represent two distinct and notable classes of sub- 
file updates. 

To provide a point of comparison, we use rdiff [14] 
to generate an rsync-style delta between consecutive file 
versions. Table 5 summarizes the results. 

The size overhead measures the storage cost of sub- 
file incrementals in Cumulus. To reconstruct the latest 
version of a file, Cumulus might need to read data from 
many past versions, though cleaning will try to keep this 
bounded. This overhead compares the average size of 
a daily snapshot (“Cumulus size”) against the average 
compressed size of the file backed up. As file churn in- 
creases overhead tends to increase. 

The delta overhead compares the data that must be up- 
loaded daily by Cumulus (“Cumulus delta’) against the 
average size of patches generated by rdiff (“rdiff delta’). 
When only a small portion of the file changes each day 
(File B), rdiff is more efficient than Cumulus in repre- 
senting the changes. However, sub-file incrementals are 
still a large win for Cumulus, as the size of the incre- 


mentals is still much smaller than a full copy of the file. 
When large parts of the file change daily (File A), the 
efficiency of Cumulus approaches that of rdiff. 


5.4.5 Upload Time 


As a final experiment, we consider the time to upload 
to a remote storage service. Our Cumulus prototype is 
capable of uploading snapshot data directly to Amazon 
S3. To simplify matters, we evaluate upload time in iso- 
lation, rather than as part of a full backup, to provide a 
more controlled environment. Cumulus uses the boto [4] 
Python library to interface with S3. 

As our measurements are from one experiment from a 
single computer (on a university campus network), they 
should not be taken as a good measure of the overall per- 
formance of S3. For large files—a megabyte or larger— 
uploads proceed at a maximum rate of about 800 KB/s. 
According to our results there is an overhead equivalent 
to a latency of roughly 100 ms per upload, and for small 
files this dominates the actual time for data transfer. It 
is thus advantageous to upload data in larger segments, 
as Cumulus does. More recent tests indicate that speeds 
may have improved. 

The S3 protocol, based on HTTP, does not support 
pipelining multiple upload requests. Multiple uploads 
in parallel could reduce overhead somewhat. Still, it re- 
mains beneficial to perform uploads in larger units. 

For perspective, assuming the maximum transfer rates 
above, ongoing backups for the fileserver and user work- 
loads will take on average 3.75 hours and under a minute, 
respectively. Overheads from cleaning will increase 
these times, but since network overheads from cleaning 
are generally small, these upload times will not change 
by much. For these two workloads, backup times are 
very reasonable for daily snapshots. 


5.4.6 Restore Time 


To completely restore all data from one of the user snap- 
shots takes approximately 11 minutes, comparable to but 
slighly faster than the time required for an initial full 
backup. 

When restoring individual files from the user 
dataset, almost all time is spent extracting and parsing 
metadata—there is a fixed cost of approximately 24 sec- 
onds to parse the metadata to locate requested files. Ex- 
tracting requested files is relatively quick, under a second 
for small files. 

Both restore tests were done from local disk; restoring 
from $3 will be slower by the time needed to download 
the data. 


6 Conclusions 


It is fairly clear that the market for Internet-hosted 
backup service is growing. However, it remains unclear 
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what form of this service will dominate. On one hand, 
it is in the natural interest of service providers to pack- 
age backup as an integrated service since that will both 
create a “stickier” relationship with the customer and al- 
low higher fees to be charged as a result. On the other 
hand, given our results, the customer’s interest may be 
maximized via an open market for commodity storage 
services (such as S3), increasing competition due to the 
low barrier to switching providers, and thus driving down 
prices. Indeed, even today integrated backup providers 
charge between $5 and $10 per month per user while the 
S3 charges for backing up our test user using the Cu- 
mulus system was only $0.24 per month. (For example, 
Symantec’s Protection Network charges $9.99 per month 
for 10GB of storage and EMC’s MozyPro service costs 
$3.95 + $0.50/GB per month per desktop.) 

Moreover, a thin-cloud approach to backup allows one 
to easily hedge against provider failures by backing up 
to multiple providers. This strategy may be particu- 
larly critical for guarding against business risk—a lesson 
that has been learned the hard way by customers whose 
hosting companies have gone out of business. Provid- 
ing the same hedge using the integrated approach would 
require running multiple backup systems in parallel on 
each desktop or server, incurring redundant overheads 
(e.g., scanning, compression, etc.) that will only increase 
as disk capacities grow. 

Finally, while this paper has focused on an admit- 
tedly simple application, we believe it identifies a key 
research agenda influencing the future of “cloud com- 
puting”: can one build a competitive product economy 
around a cloud of abstract commodity resources, or do 
underlying technical reasons ultimately favor an inte- 
grated service-oriented infrastructure? 
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Abstract 


User I/O intensity can significantly impact the perfor- 
mance of on-line RAID reconstruction due to contention 
for the shared disk bandwidth. Based on this observa- 
tion, this paper proposes a novel scheme, called WorkOut 
(I/O Workload Outsourcing), to significantly boost RAID 
reconstruction performance. WorkOut effectively out- 
sources all write requests and popular read requests orig- 
inally targeted at the degraded RAID set to a surrogate 
RAID set during reconstruction. Our lightweight pro- 
totype implementation of WorkOut and extensive trace- 
driven and benchmark-driven experiments demonstrate 
that, compared with existing reconstruction approaches, 
WorkOut significantly speeds up both the total recon- 
struction time and the average user response time. Im- 
portantly, WorkOut is orthogonal to and can be easily 
incorporated into any existing reconstruction algorithms. 
Furthermore, it can be extended to improving the perfor- 
mance of other background support RAID tasks, such as 
re-synchronization and disk scrubbing. 


1 Introduction 


As a fundamental technology for reliability and availabil- 
ity, RAID [30] has been widely deployed in modern stor- 
age systems. A RAID-structured storage system ensures 
that data will not be lost when disks fail. One of the key 
responsibilities of RAID is to recover the data that was 
on a failed disk, a process known as RAID reconstruc- 
tion. 

The performance of RAID reconstruction techniques 
depends on two factors. First, the time it takes to com- 
plete the reconstruction of a failed disk, since longer re- 
construction times translate to a longer “window of vul- 
nerability”, in which a second disk failure may cause per- 
sistent data loss. Second, the impact of the reconstruction 
process on the foreground workload, i.e., to what degree 
are user requests affected by the ongoing reconstruction. 

Current approaches for RAID reconstruction fall into 
two different categories: [11, 12]: off-line reconstruction, 
when the RAID devotes all of its resources to perform- 


ing reconstruction without serving any I/O requests from 
user applications, and on-line reconstruction, when the 
RAID continues to service user I/O requests during re- 
construction. 

Off-line reconstruction has the advantage that it’s 
faster than on-line reconstruction, but it is not practical 
in environments with high availability requirements, as 
the entire RAID set needs to be taken off-line during re- 
construction. 

On the other hand, on-line reconstruction allows fore- 
ground traffic to continue during reconstruction, but 
takes longer to complete than off-line reconstruction as 
the reconstruction process competes with the foreground 
workload for I/O bandwidth. In our experiments we find 
that on-line reconstruction (with heavy user I/O work- 
loads) can be as much as 70 times (70x) slower than 
off-line reconstruction (without user I/O workloads) (see 
Section 2.2). Moreover, while on-line reconstruction al- 
lows foreground workload to be served, the performance 
of the foreground workload might be significantly re- 
duced. In our experiments, we see cases where the user 
response time increases by a factor of 3 (3x) during on- 
line reconstruction(see Section 2.2). 

Improving the performance of on-line RAID recon- 
struction is becoming a growing concern in the light of 
recent technology trends: reconstruction times are ex- 
pected to increase in the future, as the capacity of drives 
grows at a much higher rate than other performance pa- 
rameters, such as bandwidth, seek time and rotational la- 
tency [10]. Moreover, with the ever growing number of 
drives in data centers, reconstruction might soon become 
the common mode of operation in large-scale systems 
rather than the exception [4, 9, 18, 32, 34]. 

A number of approaches have been proposed to im- 
prove the performance of RAID reconstruction, includ- 
ing for example optimizing the reconstruction work- 
flow [12, 22], the reconstruction sequence [3, 41] or the 
data layout [14, 49]. We note that all these approaches 
focus on a single RAID set. In this paper we propose a 
new approach to improving reconstruction performance, 
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exploiting the fact that most data centers contain a large 
number of RAID sets. 

Inspired by recent work on data migration [2, 24, 46] 
and write off loading [27], we propose WorkOut, a frame- 
work to significantly improve on-line reconstruction per- 
formance by I/O Workload Outsourcing. The main idea 
behind WorkOut is to temporarily redirect all write re- 
quests and popular read requests originally targeted at 
the degraded RAID set to a surrogate RAID set. The sur- 
rogate RAID set can be free space on another live RAID 
set or a set of spare disks. 

The benefits of WorkOut are two-fold. WorkOut re- 
duces the impact of reconstruction on foreground traffic 
because most user requests can be served from the surro- 
gate RAID set and hence no longer compete with the re- 
construction process for disk bandwidth. WorkOut also 
speeds up the reconstruction process, since more band- 
width on the degraded RAID set can be devoted to the 
reconstruction process. 

In more detail, WorkOut has the following salient fea- 
tures: 


(1) WorkOut tackles one of the most important factors 
adversely affecting reconstruction performance, 
namely, //O intensity, that, to the best of our knowl- 
edge, has not been adequately addressed by the pre- 
vious studies [3, 41]. 

(2) WorkOut has a distinctive advantage of improv- 
ing both reconstruction time and user response 
time. It is a very effective reconstruction optimiza- 
tion scheme focusing on optimizing write-intensive 
workloads, a roadblock for many of the existing re- 
construction approaches [41]. 

(3) WorkOut is orthogonal and complementary to and 
can be easily incorporated into most existing RAID 
reconstruction approaches to further improve their 
performance. 

(4) In addition to boosting RAID reconstruction perfor- 
mance, WorkOut is very lightweight and can be eas- 
ily extended to improve the performance of other 
background tasks, such as re-synchronization [7] 
and disk scrubbing [36], that are also becoming 
more frequent and lengthier for the same reasons 
that reconstruction is becoming more frequent and 
lengthier. 


Extensive trace-driven and benchmark-driven exper- 
iments conducted on our lightweight prototype im- 
plementation of WorkOut show that WorkOut sig- 
nificantly outperforms the existing reconstruction ap- 
proaches PR [22] and PRO [42] in both reconstruction 
time and user response time. 

The rest of this paper is organized as follows. Back- 
ground and motivation are presented in Section 2. We de- 
scribe the design of WorkOut in Section 3. Performance 
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evaluations of WorkOut based on a prototype implemen- 
tation are presented in Section 4. We analyze the relia- 
bility of WorkOut in Section 5 and present related work 
in Section 6. We point out directions for future research 
in Section 7 and summarize the main contributions of the 
paper in Section 8. 


2 Background and Motivation 


In this section, we provide some background and key ob- 
servations that motivate our work and facilitate our pre- 
sentation of WorkOut in later sections. 


2.1 Disk failures in the real world 


Recent studies of field data on partial or complete disk 
failures in large-scale storage systems indicate that disk 
failures happen at a significant rate [4, 9, 18, 32, 34]. 
Schroeder & Gibson [34] found that annual disk re- 
placement rates in the real world exceed 1%, with 2%- 
4% on average and up to 13% in some systems, much 
higher than 0.88%, the annual failure rates (AFR) spec- 
ified by the manufacturer’s datasheet. Bairavasundaram 
et al. [4] observed that the probability of latent sector 
errors, which can lead to disk replacement, is 3.45% in 
their study. Those failure rates, combined with the con- 
tinuously increasing number of drives in large-scale stor- 
age systems, raise concerns that in future storage sys- 
tems, recovery mode might become the common mode 
of operation [9]. 

Another concern arises from recent studies showing a 
significant amount of correlation in drive failures, indi- 
cating that, after one disk fails, another disk failure will 
likely occur soon [4, 9, 18]. Gibson [9] also points out 
that the probability of a second disk failure in a RAID 
system during reconstruction increases along with the 
reconstruction time: approximately 0.5% for one hour, 
1.0% for 3 hours and 1.4% for 6 hours. 

All the above trends make fast recovery from disk fail- 
ures an increasingly important factor in building storage 
systems. 


2.2 Mutually adversary impact of reconstruc- 
tion and user I/O requests 


During on-line RAID reconstruction, reconstruction re- 
quests and user I/O requests compete for the bandwidth 
of the surviving disks and adversely affect each other. 
User I/O requests increase the reconstruction time while 
the reconstruction process increases the user response 
time. 

Figure 1 shows the reconstruction times and user re- 
sponse times of a 5-disk RAIDS set with a stripe unit size 
of 64KB in three cases: (1) off-line reconstruction, (2) 
on-line reconstruction at the highest speed (when RAID 
favors the reconstruction process), and (3) on-line recon- 
struction at the lowest speed (when RAID favors user 
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Figure 1: Reconstruction and its performance impact. 


I/O requests). In this experiment we limit the capac- 
ity of each disk to 1OGB. User I/O requests are gener- 
ated by Iometer [17] with 20% sequential and 60%/40% 
read/write requests of 8KB each. As shown in Figure 1, 
the user response time increases significantly along with 
the reconstruction speed, 3 times (3 x) more than that in 
the normal mode. The on-line reconstruction process at 
the lowest speed takes 70 times (70 x) longer than its off- 
line counterpart. 

How reconstruction is performed impacts both the re- 
liability and availability of storage systems [11]. Stor- 
age system reliability is formally defined as MTTDL (the 
mean time to data loss) and increases with decreasing 
MTTR (the mean time to repair). Ironically, decreasing 
the MTTR (i.e., speeding up reconstruction) by throttling 
foreground user requests can lead SLA (Service Level 
Agreement) violations, which in many environments are 
also perceived as reduced availability. Ideally, one would 
like to reduce both the reconstruction time and user re- 
sponse time in order to improve the reliability and avail- 
ability of RAID-structured storage systems. 

In Figure 2, we take a closer look at how user I/O 
intensity affects the performance of RAID reconstruc- 
tion. The experimental setup is the same as that in Fig- 
ure 1, except that we impose different I/O request inten- 
sities. Moreover, the RAID reconstruction process is set 
to yield to user I/O requests (i.e., RAID favors user I/O 
requests). From Figure 2, we see that both the recon- 
struction time and user response time increase with IOPS 
(I/O Per Second). When increasing the user IOPS from 
9 to 200, reaches its maximum of 200, the reconstruc- 
tion time increases by a factor of 20.9 and average user 
response time increases by a factor of 3.76. 

From the above experiments and analysis, we believe 
that reducing the amount of user I/O requests directed 
to the degraded RAID set is an effective approach to si- 
multaneously reducing the reconstruction time and alle- 
viating the user performance degradation, thus improv- 
ing both reliability and availability. However, naively 
redirecting all requests from the degraded RAID to a 
surrogate RAID might overload the surrogate RAID and 
runs the risk that a lot of work is wasted by redirect- 
ing requests that will never be accessed again. Our idea 
is therefore to exploit locality in the request stream and 
redirect only requests for popular data. 
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2.3. Workload locality 


Previous studies indicate that access locality is one of the 
key web workload characteristics [1, 5, 6] and observe 
that 10% of files accessed on a web server approximately 
account for 90% of the requests and 90% of the bytes 
transferred [1]. Such studies also find that 20%-40% of 
the files are accessed only once for web workloads [6]. 

To exploit access locality, caches have been widely 
employed to improve storage system performance. Stor- 
age caches, while proven very effective in capturing 
workload locality, is so small in capacity compared with 
the typical storage device that it usually cannot cap- 
ture all workload locality. Thus the locality underneath 
the storage cache can still be effectively mined and uti- 
lized [23, 41]. For example, based on the study on C- 
Miner [23] that mines block correlation below the stor- 
age cache, correlation-directed prefetching and data lay- 
out help reduce the user response time of the baseline 
case by 12-25%. By utilizing the workload locality at 
the block level, PRO [41] reduces the reconstruction time 
by up to 44.7% and the user response time by 3.6-23.9% 
simultaneously. 

Based on these observations, WorkOut only redirects 
the popular read data to the surrogate RAID set to exploit 
access locality of read requests bound to the most pop- 
ular data. For simplicity, popular data in our design is 
defined as the data that has been read at least twice dur- 
ing reconstruction. Different from read requests, write 
requests can be served by any persistent storage device. 
Thus WorkOut redirects all write requests to the surro- 
gate RAID set. 


3 WorkOut 


In this section, we first outline the main principles guid- 
ing the design of WorkOut. Then we present an archi- 
tectural overview of WorkOut, followed by a description 
of the WorkOut organization and algorithm. The design 
choice and data consistency issues of WorkOut are dis- 
cussed at the end of this section. 


3.1 Design principles 


WorkOut focuses on outsourcing I/O workloads and aims 
to achieve reliability, availability, extendibility and flexi- 
bility, as follows. 
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Reliability. To reduce the window of vulnerability and 
thus improve the system reliability, the reconstruction 
time must be significantly reduced. Since user I/O inten- 
sity severely affects the reconstruction process, WorkOut 
aims to reduce the I/O intensity on the degraded RAID 
set by redirecting I/O requests away from the degraded 
RAID set. 

Availability. To avoid a drop in user perceived perfor- 
mance and violation of SLAs, the user response time dur- 
ing reconstruction must be significantly reduced. Work- 
Out strives to achieve this goal by significantly reducing, 
if not eliminating, the contention between external user 
1/O requests and internal reconstruction requests, by out- 
sourcing I/O workloads to a surrogate RAID set. 

Extendibility. Since I/O intensity affects the per- 
formance of not only the reconstruction process but 
also other background support RAID tasks, such as re- 
synchronization and disk scrubbing, the idea of WorkOut 
should be readily extendable to improve the performance 
of these RAID tasks. 

Flexibility. Due to the high cost and inconvenience 
involved in modifying the organization of an existing 
RAID, it is desirable to completely avoid such modifica- 
tion and instead utilize a separate surrogate RAID set ju- 
diciously and flexibly. In the WorkOut design, the surro- 
gate RAID set can be a dedicated RAID1 set, a dedicated 
RAIDS set or a live RAID set that uses the free space of 
another operational (live) RAID set. Using a RAID as 
the surrogate set ensures that the redirected write data 
is safe-guarded with redundancy, thus guaranteeing the 
consistency of the redirected data. How to choose an 
appropriate surrogate RAID set is based on the require- 
ments on overhead, performance, reliability, and main- 
tainability and trade-offs between them. 


3.2. WorkOut architecture overview 


Figure 3 shows an overview of WorkOut’s architecture. 
In our design, WorkOut is an augmented module to the 
RAID controller software operating underneath the stor- 
age cache in a system with multiple RAID sets. Work- 
Out interacts with the reconstruction module, but is im- 
plemented independently of it. WorkOut can be incor- 
porated into any RAID controller software, including 
various reconstruction approaches, and also other back- 
ground support RAID tasks. In this paper, we focus 
on the reconstruction process and a discussion on how 
WorkOut works with some other background support 
RAID tasks can be found in Section 4.7. 

WorkOut consists of five key functional components: 
Administration Interface, Popular Data Identifier, Surro- 
gate Space Manager, Request Redirector and Reclaimer, 
as shown in Figure 3. Administration Interface pro- 
vides an interface for system administrators to config- 
ure the WorkOut design options. Popular Data Identifier 
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Figure 3: An architecture overview of WorkOut. 


is responsible for identifying the popular read data. Re- 
quest Redirector is responsible for redirecting all write 
requests and popular read requests to the surrogate RAID 
set, while Reclaimer is responsible for reclaiming all 
redirected write data back after the reconstruction pro- 
cess completes. Surrogate Space Manager is responsi- 
ble for allocating and managing a space on the surro- 
gate RAID set for each current reconstruction process 
and controlling the data layout of the redirected data in 
the allocated space. 

WorkOut is automatically activated by the reconstruc- 
tion module when the reconstruction thread is initiated 
and de-activated when the reclaim process completes. In 
other words, WorkOut is active throughout the entire re- 
construction and reclaim periods. Moreover, the reclaim 
thread is triggered by the reconstruction module when 
the reconstruction process completes. 

In WorkOut, the idea of a dedicated surrogate RAID 
set is targeted at a typical data center where the surro- 
gate set can be shared by multiple degraded RAID sets 
on either a space-division or a time-division basis. The 
degraded RAID set and surrogate RAID set are not a 
one-to-one mapping. The device overhead is incurred 
only during reconstruction of a degraded RAID set and 
typically amounts to a small fraction of the surrogate set 
capacity. For example, extensive experiments on our pro- 
totype implementation of WorkOut show that, of all the 
traces and benchmarks, no more than 4% of the capac- 
ity of a 4-disk surrogate RAIDS set is used during the 
WorkOut reconstruction. 

Based on the pre-configured parameters by system ad- 
ministrators, the Surrogate Space Manager allocates a 
disjoint space for each degraded RAID set that requests 
a surrogate RAID set, thus preventing the redirected data 
from being overwritten by redirected data from other de- 
graded RAID sets. Noticeably, the space allocated to a 
degraded RAID set is not fixed and can be expanded. For 


USENIX Association 


USENIX Association 











D_Table R_LRU 
D_Offset, S_Offset, Length, D_Flag ++ D_Offset, Length ++ 
D_Offset, S_Offset, Length, D_Flag -*- D_Offset, Length -* 








‘ ae 
’ ee Mg, eH 


Figure 4: Data structures of WorkOut. 


example, the Surrogate Space Manager first allocates an 
estimated space required for a typical degraded RAID set 
and, if the allocated space is used up to a preset thresh- 
old (e.g., 90%), it will allocate some extra space to this 
RAID set. In this paper, we mainly consider the sce- 
nario where there is at most one degraded RAID set at 
any given time. Implementing and evaluating WorkOut 
in a large-scale storage system with multiple concurrent 
degraded RAID sets are work in process. 


3.3. The WorkOut organization and algorithm 


WorkOut relies on two key data structures to redirect re- 
quests and identify popular data, namely, D_Table and 
R_LRU, as shown in Figure 4. D_Table contains the log 
of all redirected data, including the following four im- 
portant variables. D_Offset and S_Offset indicate the off- 
sets of the redirected data in the degraded RAID set and 
the surrogate RAID set, respectively. Length indicates 
the length of the redirected data and D_Flag indicates 
whether it is the redirected write data from the user ap- 
plication (D_Flag is set to be true) or the redirected read 
data from the degraded RAID set (D_Flag is set to be 
Jalse). R-LRU is an LRU list that stores the informa- 
tion (D_offset and Length of read data) of the most recent 
read requests. Based on R-LRU, popular read data can 
be identified and redirected to the surrogate RAID set. 
WorkOut focuses on outsourcing user I/O requests 
during reconstruction and does not modify the recon- 
struction algorithm. How to perform the reconstruction 
process remains the responsibility of the reconstruction 
module and depends on the specific reconstruction algo- 
rithm, and is thus not described further in this paper. 
During reconstruction, all write requests are redirected 
to the surrogate RAID set after determining whether they 
should overwrite their previous location or write to a new 
location according to D_Table. Whereas, for each read 
request, D_Table is first checked to determine whether 
the read data is in the surrogate RAID set. If the read 
request does not hit D_Table, it will be served by the de- 
graded RAID set. If it hits R-LRU, the read data is con- 
sidered popular and redirected to the surrogate RAID set, 
and the corresponding data information is inserted into 
D_Table. If the entire targeted read data is already in the 
surrogate RAID set, the read request will be served by 
the surrogate RAID set. Otherwise, if only a portion of 
the read data is in the surrogate RAID set, i.e., it partially 
hits D_Table, the read request will be split and served by 
both the sets. In order to achieve better performance, the 


redirected data is laid out sequentially like LFS [33] in 
the allocated space on the surrogate RAID set. 

The redirected write data is only temporarily stored 
in the surrogate RAID set and thus should be reclaimed 
back to the newly recovered RAID set (i.e., the for- 
merly degraded RAID set) after the reconstruction pro- 
cess completes. To ensure data consistency, the log of 
reclaimed data is deleted from D_Table after the write 
succeeds. Since the redirected read data is already in the 
degraded RAID set, it need not be reclaimed as long as 
logs of such data are deleted from D_Table to indicate 
that the data in the surrogate RAID set is invalid. In or- 
der not to affect the performance of the newly recovered 
RAID set, the priority of the reclaim process is set to 
be the lowest, which will not affect the reliability of the 
redirected data as explained in Section 5. 

During the reclaim period, all requests on the newly 
recovered RAID set must be checked carefully in 
D_Table to ensure data consistency. If a write request 
hits D_Table and its D_Flag is true, meaning that it will 
rewrite the old data that is still in the surrogate RAID set, 
the corresponding log in D_Table must be deleted after 
writing the data to the correct location on the newly re- 
covered RAID set, to prevent the new write data from 
being overwritten by the reclaimed data. In addition, if a 
read request hits D_Table and its D_Flag is true, meaning 
that the up-to-date data of the read request has not been 
reclaimed back, the read request will be served by the 
surrogate RAID set. 


3.4 Design choices 


WorkOut can redirect data to different persistent config- 
urations of storage devices, such as a dedicated surrogate 
RAID1 set, a dedicated surrogate RAIDS set and a live 
surrogate RAID set. 

A dedicated surrogate RAID1 set. In this case, 
WorkOut stores the redirected data in two mirroring 
disks, namely, a dedicated surrogate RAID1 set. The ad- 
vantage of this design option is its high reliability, simple 
space management and moderate device overhead (i.e., 2 
disks), while its disadvantage is obvious: relatively low 
performance gain due to the lack of I/O parallelism. 

A dedicated surrogate RAIDS set. In favor of relia- 
bility and performance (access parallelism), a dedicated 
surrogate RAIDS set with several disks can be deployed 
to store the redirected data. The space management is 
simple while the device overhead (e.g., 4 disks) is rela- 
tively high. 

A live surrogate RAID set. WorkOut can utilize 
the free space of another live surrogate RAID set in a 
large-scale storage system consisting of multiple RAID 
sets and does not incur any additional device overhead 
that the first two design options cannot avoid. In this 
case, WorkOut gains high reliability owing to its redun- 
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Optional surrogate RAID set _ || Device Overhead Reliability Maintainability 
K dedicated surrogate RATDI set [| medium | medium | — igh | __simple___| 


A dedicated surrogate RAIDS set || high | high | high | simple 





A live surrogate RAID set | low 


low 


jnedium-high 


Table 1: Characteristic comparisons of three optional surrogate RAID sets used in WorkOut. 


dancy, but requires complicated maintenance. Due to the 
contention between the redirected requests from the de- 
graded RAID set and the native I/O requests targeted at 
the live surrogate RAID set, the performance in this case 
is lower than that in the former two design options. 

The three design options are all feasible and can be 
made available for system administrators to choose from 
through the Administration Interface based on their char- 
acteristics and tradeoffs, as summarized in Table 1. In 
this paper, the prototype implementation and perfor- 
mance evaluations are centered around the dedicated sur- 
rogate RAIDS set, although sample results from the other 
two design choices are also given to show the quantita- 
tive differences among them. 


3.5 Data consistency 


Data consistency in WorkOut includes two aspects: (1) 
Redirected data must be reliably stored in the surrogate 
RAID set, (2) The key data structures should be safely 
stored until the reclaim process completes. 

First, in order to avoid data loss caused by a disk fail- 
ure in the surrogate RAID set, all redirected write data in 
the surrogate RAID set should be protected by a redun- 
dancy scheme, such as RAID1 or RAIDS. To simplify 
the design and implementation, the redirected read data 
is stored in the same manner as the redirected write data. 
If a disk failure in the surrogate RAID set occurs, data 
will no longer be redirected to the surrogate RAID set 
and the write data that was already redirected should be 
reclaimed back to the degraded RAID set or redirected 
to another surrogate RAID set if possible. Our prototype 
implementation adopts the first option. We will analyze 
the reliability of WorkOut in Section 5. 

Second, since we must ensure never to lose the con- 
tents of D_Table during the entire period when Work- 
Out is activated, it is stored in a NVRAM to prevent 
data loss in the event of a power supply failure. Fortu- 
nately, D_Table is in general very small (see Section 4.8) 
and thus will not incur significant hardware cost. More- 
over, since the performance of battery-backed RAM, a 
de facto standard form of NVRAM for storage con- 
trollers [8, 15, 16], is roughly the same as the main mem- 
ory, the write penalty due to D_Table updates can be neg- 
ligible. 


4 Performance Evaluations 

In this section, we evaluate the performance of pro- 
totype implementation of WorkOut through extensive 
trace-driven and benchmark-driven experiments. 
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4.1 Prototype implementation 


We have implemented WorkOut by embedding it into the 
Linux Software RAID (MD) as a built-in module. In or- 
der not to impact the RAID performance in normal mode, 
WorkOut is activated only when the reconstruction pro- 
cess is initiated. During reconstruction, WorkOut tracks 
user I/O requests in the make_request function and issues 
them to the degraded RAID set or the surrogate RAID set 
based on the request type and D_Table. 

By setting the reconstruction bandwidth range, MD 
assigns different disk bandwidth to serve user I/O re- 
quests and reconstruction requests and ensures that the 
reconstruction speed is confined within the set range 
(i.e., between the minimum and maximum reconstruc- 
tion bandwidth). For example, if the reconstruction 
bandwidth range is set to be the default of 1MB/s- 
200MB/s, MD will favor user I/O requests while en- 
suring that the reconstruction speed is at least 1MB/s. 
Under heavy I/O workloads, MD will keep the recon- 
struction speed at approximately 1MB/s but allows it to 
be much higher than 1MB/s when I/O intensity is low. 
At one extreme when there is no user I/O, the recon- 
struction speed will be roughly equal to the disk trans- 
fer rate (e.g., 783 MB/s in our prototype system). Equiva- 
lently, the minimum reconstruction bandwidth of X MB/s 
(e.g., IMB/s, 1OMB/s, 1|OOMB/s) refers to a reconstruc- 
tion range of X MB/s-200MB/s in MD. When the mini- 
mum reconstruction bandwidth is set to 1OOMB/s, which 
is not achievable for most disks, MD utilizes any disk 
bandwidth available for the reconstruction process. 

To better examine the WorkOut performance on ex- 
isting RAID reconstruction algorithms, we incorporate 
WorkOut into MD’s default reconstruction algorithm PR, 
and PRO-powered PR (PRO for short) that is also imple- 
mented in MD. PR (Pipeline Reconstruction) [22] takes 
advantage of the sequential property of track retrievals 
to pipeline the reading and writing processes. PRO 
(Popularity-based multi-threaded Reconstruction Opti- 
mization) [41, 42] allows the reconstruction process to 
rebuild the frequently accessed areas prior to other areas. 


4.2 Experimental setup and methodology 


We conduct our performance evaluation of WorkOut on 
a platform of server-class hardware with an Intel Xeon 
3.0GHz processor and 1GB DDR memory. We use 
2 Highpoint RocketRAID 2220 SATA cards to house 
15 Seagate ST3250310AS SATA disks. The rotational 
speed of these disks is 7200 RPM, with a sustained trans- 
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fer rate of 78MB/s, and the individual disk capacity is 
250GB. A separate IDE disk is used to house the operat- 
ing system (Fedora Core 4 Linux, kernel version 2.6.11) 
and other software (MD and mdadm). For the footprint 
of the workloads, we limit the capacity of each disk to 
10GB in the experiments. In our prototype implemen- 
tation, the main memory is used to substitute a battery- 
backed RAM for simplicity. 

Generally speaking, there are two models for trace re- 
play: open-loop and closed-loop [26, 35]. The former 
has the potential to overestimate the user response time 
measure since the I/O arrival rate is independent of the 
underlying system and thus can cause the request queue 
(and hence the queuing delays) to grow rapidly when sys- 
tem load is high. The opposite is true for closed systems 
as the I/O arrival rate is dictated by the processing speed 
of the underlying system and the request queue is gen- 
erally limited in length (i.e., equal to the number of in- 
dependent request threads). In this paper, we use both 
an open-loop model (trace replay with RAIDmeter [41]) 
and a closed-loop model (TPC-C-like benchmark [43]) 
to evaluate the performance of WorkOut. 

The traces used in our experiments are obtained from 
the Storage Performance Council [29, 39]. The two fi- 
nancial traces (Fin1 and Fin2) were collected from OLTP 
applications running at a large financial institution and 
the WebSearch? trace (or Web) was collected from a ma- 
chine running a web search engine. The three traces rep- 
resent different access patterns in terms of write ratio, 
IOPS and average request size, as shown in Table 2. The 
write ratio of the Fin1 trace is the highest, followed by 
the Fin2 trace. The read-dominated Web trace exhibits 
the strongest locality in its access pattern. Since the re- 
quest rate in the Web trace is too high to be sustained by 
our degraded RAID set, we only use one part of it that 
is attributed to device zero while the part due to devices 
one and two is ignored. 

Since the three traces have very limited footprints, that 
is the user I/O requests are congregated on a small part 
of the RAID set (e.g., less than 10% of an 8-disk RAIDS 
set for the Fin] trace), their replays may not realistically 
represent a typical reconstruction scenario where user re- 
quests may be spread out over the entire disk address 
space. To fully and evenly cover the address space of the 
RAID set, we scale up the address coverage of the I/O 
requests by multiplying the address of each request with 
an appropriate scaling factor (constant) without chang- 
ing the size of each request. While the main adverse 


impact of this trace scaling is likely to be on those re- 
quests that are originally sequential but can subsequently 
become non-sequential after scaling, the percentage of 
such sequential requests in the three traces of this study 
is relatively small at less than 4% [45]. Thus, we believe 
that the adverse impact of the trace scaling is rather lim- 
ited in this study and far outweighed by the benefits of 
the scaling that attempts to represent a more realistic re- 
construction scenario. We find in our experiments that 
the observed trends are similar for the original and the 
scale trace, suggesting that neither is likely to generate 
noticeably different conclusions for the study. Neverthe- 
less, we choose to present the results of the latter for the 
aforementioned reasons. 

The trace replay tool is RAIDmeter [41, 42] that re- 
plays traces at block-level and evaluates the user re- 
sponse time of the storage device. The RAID reconstruc- 
tion performance is evaluated in terms of the following 
two metrics: reconstruction time and average user re- 
sponse time during reconstruction. 

The TPC-C-like benchmark is implemented with 
TPCC-UVA [31] and the Postgres database. It gen- 
erates mixed transactions based on the TPC-C specifi- 
cation [43]. 20 warehouses are built on the Postgres 
database with the ext3 file system on the degraded RAID 
set. Transactions, such as PAYMENT, NEW_ORDER 
and DELIVERY, generate read and write requests. To 
evaluate the WorkOut performance, we compare the 
transaction rates (transactions per minute) that are gen- 
erated at the end of the benchmark execution. 


4.3. Trace-driven evaluations 


We first conduct experiments on an 8-disk RAID5 set 
with a stripe unit size of 64KB while running PR, PRO 
and WorkOut-powered PR and PRO respectively. Ta- 
ble 3 and Table 4, respectively, show the reconstruction 
time and average user response time under the minimum 
reconstruction bandwidth of 1MB/s, driven by the three 
traces. We configure a 4-disk dedicated RAIDS set with 
a stripe unit size of 64KB as the surrogate RAID set to 
boost the reconstruction performance of the 8-disk de- 
graded RAID set. 

From Table 3, one can see that WorkOut speeds up the 
reconstruction time by a factor of up to 5.52, 1.64 and 
1.30 for the Fin1, Fin2 and Web traces, respectively. The 
significant improvement achieved on the Fin1 trace, with 
a reconstruction time of 203.1s vs. 1121.8s for PR and 
188.3s vs. 1109.6s for PRO, is due to the fact that 84% 
of requests (69% of writes plus 15% of reads) are redi- 
rected away from the degraded RAID set (see Figure 5), 
which enables the speed of the on-line reconstruction to 
approach that of the off-line counterpart. In our experi- 
ments, the off-line reconstruction time is 136.4 seconds 
for PR on the same platform. Moreover, WorkOut out- 
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Table 3: The reconstruction time results. 
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Figure 5: Percentage of redirected requests for WorkOut, 
under the minimum reconstruction bandwidth of 1MB/s. 


sources 36% and 34% of user I/O requests away from the 
degraded RAID set for the Fin2 and Web traces, which 
is much fewer than that for the Fin! trace, thus reducing 
the reconstruction time accordingly. 

Table 4 shows that, compared with PR, WorkOut 
speeds up the average user response time by a factor of 
up to 2.87, 2.66 and 1.36 for the Finl, Fin2 and Web 
traces, respectively. For Fin! and Fin2, the average user 
response times during reconstruction under WorkOut are 
even better than that in the normal or degraded period. 
The reasons why WorkOut achieves significant improve- 
ment on user response times are threefold. First, a sig- 
nificant amount of requests are redirected away from the 
degraded RAID set, as shown in Figure 5. The response 
times of redirected requests are no longer affected by 
the reconstruction process that competes for the available 
bandwidth with user I/O requests on the degraded RAID 
set. Second, redirected data is laid out sequentially in the 
surrogate RAID set, thus further speeding up the user re- 
sponse time. Third, since many requests are outsourced, 
the I/O queue on the degraded RAID set is shortened 
accordingly, thus reducing the response times of the re- 
maining I/O requests served by the degraded RAID set. 
Therefore, the average user response time with WorkOut 
is significantly lower than that without WorkOut, espe- 
cially for the Fin! trace. 

Table 3 and Table 4 show that WorkOut-powered PRO 
performs similarly to WorkOut-powered PR. The reason 
is that WorkOut redirects all write requests and popular 
read requests to the surrogate RAID set, thus reducing 
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Figure 6: Comparisons of reconstruction times and aver- 
age user response times with respect to different min- 
imum reconstruction bandwidth (1MB/s, 1OMB/s and 
100MB/s) driven by the Fin2 trace. 


the degree of popularity of I/O workloads retained on 
the degraded RAID set that can be exploited by PRO. 
Based on this observation, in the following experiments, 
we only compare WorkOut-powered PR (short for Work- 
Out) with PR and PRO. 


4.4 Sensitivity study 


WorkOut’s performance is likely influenced by several 
important factors, including the available reconstruction 
bandwidth, the size of the degraded RAID set, the stripe 
unit size, and the RAID level. Due to lack of space, 
we limit our study of these parameters to the Fin? trace. 
Other traces show similar trends as the Fin2 trace. 

Reconstruction bandwidth. To evaluate how the 
minimum reconstruction bandwidth affects reconstruc- 
tion performance, we conduct experiments that measure 
reconstruction times and average user response times as 
a function of different minimum reconstruction band- 
width, 1MB/s, 1|OMB/s and 100MB/s, respectively. Fig- 
ure 6 plots the experimental results on an 8-disk RAIDS 
set with a stripe unit size of 64KB. 

Figure 6(a)shows that WorkOut speeds up the recon- 
struction time more significantly with a lower minimum 
reconstruction bandwidth than with a higher one. The 
reason is that the reconstruction process already exploits 
all available disk bandwidth when the reconstruction 
bandwidth is higher, thus leaving very small room for 
the reconstruction time to be further improved. 
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Figure 7: Comparisons of reconstruction times and av- 
erage user response times with respect to the number of 
disks (5, 8, 11) driven by the Fin? trace. 


From Figure 6(b), in contrast, the user response time 
increases rapidly with the increasing minimum recon- 
struction bandwidth for both PR and PRO, but much 
more slowly for WorkOut. WorkOut speeds up the user 
response time significantly, by a factor of up to 10.2 and 
7.38 over PR and PRO, respectively, when the minimum 
reconstruction bandwidth is set to 1OOMB/s. From this 
viewpoint, the user response time with WorkOut is much 
less sensitive to the minimum reconstruction bandwidth 
than that without WorkOut. In other words, if the recon- 
struction bandwidth is set very high or the storage system 
is reliability-oriented, that is the reconstruction process 
is given more bandwidth to favor the system reliability, 
the user response time improvement by WorkOut will be 
much more significant. Moreover, the user response time 
during reconstruction for PR and PRO is so long that it 
will likely violate SLA and thus become unacceptable to 
end users. 


Number of disks. To examine the sensitivity of Work- 
Out to the number of disks of the degraded RAID set, we 
conduct experiments on RAIDS sets consisting of differ- 
ent numbers of disks (5, 8 and 11) with a stripe unit size 
of 64KB under the minimum reconstruction bandwidth 
of 1MB/s. Figure 7 shows the experimental results for 
PR, PRO and WorkOut. 


Figure 7(a) and Figure 7(b) show that for all three ap- 
proaches, the reconstruction time increases and the user 
response time decreases for higher number of disks in 
the degraded RAID set. The reason is that more disks 
in a RAID set imply not only a larger RAID group size 
and thus more disk read operations to reconstruct a failed 
drive, but also higher parallelism for the I/O process. 
However, WorkOut is less sensitive to the number of 
disks than PR and PRO. 

Stripe unit size. To examine the impact of the stripe 
unit size, we conduct experiments on an 8-disk RAIDS 
set with stripe unit sizes of 16KB and 64KB, respec- 
tively. The experimental results show that WorkOut out- 
performs PR and PRO in reconstruction time as well 
as average user response time for both stripe unit sizes. 
Moreover, the reconstruction times and average user re- 
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Figure 8: Comparisons of reconstruction times and aver- 
age user response times with respect to different RAID 
levels (10, 6) driven by the Fin2 trace. 


sponse times of WorkOut are almost unchanged, suggest- 
ing that WorkOut is not sensitive to the stripe unit size. 

RAID level. To evaluate WorkOut with differ- 
ent RAID levels, we conduct experiments on a 4-disk 
RAID10 set and an 8-disk RAID6 set with the same 
stripe unit size of 64KB under the minimum reconstruc- 
tion bandwidth of 1MB/s. In the RAID6 experiments, we 
measure the reconstruction performance when two disks 
fail concurrently. 

From Figure 8, one can see that WorkOut speeds 
up both the reconstruction times and average user re- 
sponse times for the two sets. The difference in the 
amount of performance improvement seen for RAID10 
and RAID6 is caused by the different user I/O intensi- 
ties, since the RAID10 and RAID6 sets have different 
numbers of disks. The user I/O intensity on individual 
disks in the RAID 10 set is higher than that in the RAID6 
set, thus leading to longer reconstruction times. 

On the other hand, since each read request to the failed 
disks in a RAID6 set must wait for its data to be rebuilt 
on-the-fly, the user response time is severely affected for 
PR, while this performance degradation is significantly 
lower under WorkOut due to its external I/O outsourc- 
ing. For the RAID10 set, however, the situation is quite 
different. Since the read data can be directly returned 
from the surviving disks, Workout provides smaller im- 
provements in user response time for RAID10 than for 
RAID6. 


4.5 Different design choices for the surrogate 
RAID set 


All experiments reported up to this point in this paper 
adopt a dedicated surrogate RAID5 set. To examine 
the impact of different types of surrogate RAID set on 
the WorkOut performance, we also conduct experiments 
with a dedicated surrogate RAID1 set (two mirroring 
disks) and a live surrogate RAID set (replaying the Finl 
trace on a 4-disk RAIDS set). Similar to the experiments 
conducted in the PARAID [46] and write off-loading [27] 
studies, we reserve the 10% portion of storage space at 
the end of the live RAIDS set to store data redirected by 
WorkOut. The degraded RAID set is an 8-disk RAID5 
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Figure 9: A comparison of average user response times 
for different types of surrogate RAID set. 


set with a stripe unit size of 64KB and under the mini- 
mum reconstruction bandwidth of 1 MB/s. 

The experimental results show that the reconstruction 
times achieved by WorkOut are almost the same for the 
three types of surrogate RAID set and outperform PR as 
expected, shown in Table 3, since WorkOut outsources 
the same amount of requests during reconstruction. The 
results for user response times are somewhat different, as 
shown in Figure 9. The dedicated surrogate RAIDS set 
results in the best user response times for the three traces. 

From Figure 9, one can see that the dedicated surro- 
gate RAID sets (both RAID1 and RAID5) outperform 
the live surrogate RAID set in user response time. The 
reason is the contention between the native I/O requests 
and the redirected requests in the live surrogate RAID 
set. Serving the native I/O requests not only increases 
overload on the surrogate RAID set, compared with the 
dedicated surrogate RAID set, but also destroys some of 
the sequentiality in LFS style writes. The redirected re- 
quests also increase the overall I/O intensity on the live 
surrogate RAID set and affect its performance. Our ex- 
perimental results show that the performance impact on 
the live surrogate RAID set is 43.9%, 23.6% and 36.8% 
on average when the degraded RAID set replays the Fin1, 
Fin2 and Web traces, respectively. The experimental re- 
sults are consistent with the comparisons in Table 1. 


4.6 Benchmark-driven evaluations 


In addition to trace-driven experiments, we also conduct 
experiments on an 8-disk RAIDS set with a stripe unit 
size of 64KB under the minimum reconstruction band- 
width of 1MB/s, driven by a TPC-C-like benchmark. 

From Figure 10(a), one can see that PRO performs al- 
most the same as PR due to the random access charac- 
teristics of the TPC-C-like benchmark. Since WorkOut 
outsources all write requests that are generated by the 
transactions, both the degraded RAID set and surrogate 
RAID set serve the benchmark application, thus increas- 
ing the transaction rate. WorkOut outperforms PR and 
PRO in terms of transaction rate, with an improvement 
of 46.6% and 36.9% respectively. It also outperforms 
the original system in the normal mode (the normalized 
baseline) and the degraded mode, with an improvement 
of 4.0% and 22.6% respectively. 
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Figure 10: Comparisons of reconstruction times and 
transaction rates driven by the TPC-C-like benchmark. 
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Figure 11: Comparisons of re-synchronization times and 
average user response times during re-synchronization. 


On the other hand, since the TPC-C-like benchmark is 
highly I/O intensive, all disks in the RAID set are driven 
to saturation, thus the reconstruction speed is kept at 
around its minimum allowable bandwidth of 1MB/s for 
PR and PRO. As shown in Figure 10(b), the reconstruc- 
tion times for PR and PRO are similar, at 9835 seconds 
and 9815 seconds, respectively, while that for WorkOut 
is 8526 seconds, with approximately 15% improvement 
over PR and PRO. WorkOut gains much less in recon- 
struction time with the benchmark-driven experiments 
than with the trace-driven experiments. The main rea- 
son lies in the fact that the very high I/O intensity of the 
benchmark application constantly pushes the RAID set 
to operate at or close to its saturation point, leaving very 
little disk bandwidth for the reconstruction process even 
with some of the transaction requests being outsourced 
to the surrogate RAID set. 


4.7 Re-synchronization with WorkOut 


To demonstrate how WorkOut optimizes other back- 
ground support RAID tasks, such as RAID re- 
synchronization, we conduct experiments on an 8-disk 
RAIDS set with a stripe unit size of 64KB under the min- 
imum re-synchronization bandwidth of 1MB/s, driven 
by the three traces. We configure a dedicated 4-disk 
RAIDS set with a stripe unit size of 64KB as the 
surrogate RAID set. The experimental results of re- 
synchronization times and average user response times 
during re-synchronization are shown in Figure 1 1(a) and 
Figure 11(b), respectively. 

Although the re-synchronization process performs 
somewhat differently from the reconstruction process, 
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re-synchronization requests also compete for the disk re- 
sources with user I/O requests. By redirecting a signifi- 
cant amount of user I/O requests away from the RAID set 
during re-synchronization, WorkOut can reduce both the 
re-synchronization times and user response times. The 
results are very similar to that in the reconstruction ex- 
periments, so are the reasons behind them. 


4.8 Overhead analysis 


Device overhead. WorkOut is designed for use in a 
large-scale storage system consisting of many RAID sets 
that share one surrogate RAID composed of spare disks. 
In such an environment, the device overhead introduced 
by WorkOut is small given that a single surrogate RAID 
can be shared by many production RAID sets. Neverthe- 
less, for a small-scale storage system composed of only 
one or two RAID sets with few hot spare disks, the device 
overhead of a dedicated surrogate RAID set in WorkOut 
cannot be ignored. In this case, to be cost-effective, we 
recommend the use of a dedicated surrogate RAID1 set 
instead of a dedicated surrogate RAIDS set, since the de- 
vice overhead of the former (i.e., 2 disks) is lower than 
that of the latter (e.g., 4 disks in our experiments). 

To quantify the cost-effectiveness of WorkOut in 
this resource-restricted environment, we conduct exper- 
iments and compare the performance of WorkOut (8- 
disk data RAIDS set plus 4-disk surrogate RAIDS set) 
with that of PR (12-disk data RAIDS set), i.e., we use 
the same number of disks in both systems. Experiments 
are run under the minimum reconstruction bandwidth of 
1MB/s and driven by the Fin2 trace. The results show 
that WorkOut speeds up the reconstruction time of PR 
significantly, by a factor of 1.66. The average user re- 
sponse time during reconstruction achieved by WorkOut 
is 16.5% shorter than that achieved by PR, while the av- 
erage user response time during the normal period in the 
8-disk RAIDS set is 20.1% longer than that in the 12-disk 
RAIDS set due to the reduced access parallelism of the 
former. In summary, we can conclude that WorkOut is 
cost-effective in both large-scale and small-scale storage 
systems. 

Memory overhead. To prevent data loss, WorkOut 
uses non-volatile memory to store D_Table, thus incur- 
ring extra memory overhead. The amount of memory 
consumed is largest when the minimum reconstruction 
bandwidth is set to 1MB/s, since in this case the re- 
construction time is the longest and the amount of redi- 
rected data is the largest. In the above experiments on 
the RAIDS set with individual disk capacity of 10GB, the 
maximum memory overheads are 0.14MB, 0.62MB and 
1.69MB for the Fin1, Fin2 and Web traces, respectively. 
However, the memory overhead incurred by WorkOut is 
only temporary and will be removed after the reclaim 
process completes. With the rapid increase in memory 


size and decrease in cost of non-volatile memories, this 
memory overhead is arguably reasonable and acceptable 
to end users. 

Implementation overhead. WorkOut contains 780 
lines of added or modified code to the source code of 
the Linux software RAID (MD), with most lines of code 
added to md.c and raidx.c while 37 lines of data struc- 
ture code added to md_k.h and raidx.h. Since most of 
the added code is independent of the underlying RAID 
layout, they are easy to be shared by different RAID lev- 
els. Moreover, the added code is independent of the re- 
construction module, so it is easy to adapt the code for 
use with other background support RAID tasks. All that 
needs to be done is modifying the corresponding flag that 
triggers WorkOut. Due to the independent implementa- 
tion of the WorkOut module, it is portable to other soft- 
ware RAID implementations in other operating systems. 


5 Reliability Analysis 
In this section, we adopt the MTTDL metric to estimate 
the reliability of WorkOut. We assume that disk failures 
are independent events following an exponential distri- 
bution of rate jz, and repairs follow an exponential distri- 
bution of rate v. For simplicity, we do not consider the 
latent sector error in the system model. 

According to the conclusion about the reliability of 
RAIDS [50], MTTDL of an 8-disk RAIDS set achieved 
by PR and PRO is: 


lbut+y 


MTTDLRarps-3 = 562 


(1) 

Figure 12 shows the state transition diagram for a 
WorkOut-enabled storage system configuration consist- 
ing of an 8-disk data RAIDS set and a 4-disk surrogate 
RAIDS set. Note that by design WorkOut always re- 
claims the redirected write data from the surrogate RAID 
set upon a surrogate disk failure. Once the reclaim pro- 
cess is completed there is no more valid data of the de- 
graded RAID set on the surrogate RAID set. Therefore, 
there is no need to reconstruct data on the failed disk of 
the surrogate RAID set. This means that the degraded 
surrogate RAID set can be recovered by simply replac- 
ing the failed disk with a new one, resulting in a newly 
recovered operational 4-disk surrogate RAIDS set ready 
to be used by the 8-disk degraded data RAIDS set. As 
a result, the state transition diagram only shows the re- 
claim process but not the reconstruction process of the 
4-disk surrogate RAIDS set. 

State <0> represents the normal state of the system 
when its 8 data disks are all operational. A failure of any 
of the 8 data disks would bring the system to state <1> 
and a subsequent failure of any of the remaining 7 data 
disks would result in data loss. A failure of any of the 4 
surrogate disks in state <0> does not affect the system 
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Figure 12: State transition diagram for a WorkOut- 
enabled storage system configuration consisting of an 8- 
disk RAIDS set and a 4-disk surrogate RAIDS set. Note: 
The 4-disk surrogate RAIDS set does not need to recon- 
struct the data on the failed disk as long as the redirected 
write data in the surrogate RAIDS set is reclaimed back 
to the 8-disk degraded data RAID set. 


reliability of the 8-disk RAIDS set as long as the redi- 
rected write data on the former is reclaimed back to the 
latter and thus it is omitted from the state transition dia- 
gram. In state <1>, a failure of any of the 4 surrogate 
disks would bring the system to state <2>. A second 
failure in either the 8-disk data RAID5 set (1 out of 7) 
or the 4-disk surrogate RAIDS set (1 out of 3) in state 
<2> would result in data loss. In state <2>, WorkOut 
reclaims the redirected write data back to the 8-disk data 
RAID set, which brings the system back to state <1> 
and follows an exponential distribution of rate K,. This 
transition implicitly assumes that, while redirected write 
data is being reclaimed from the surrogate set to the data 
set, the reconstruction process on the latter is temporar- 
ily suspended. This simplifying assumption is justifiable 
and will not affect the result noticeably since the reclaim 
time is much shorter than the reconstruction time on the 
8-disk data RAID set. Finishing the reconstruction pro- 
cess of the 8-disk data RAIDS set would bring the system 
from state <1> to state <3>, where the redirected write 
data has not been reclaimed. Then finishing the reclaim 
process would bring it back to state <0>, which follows 
an exponential distribution of rate K2. In state <3>, a 
failure of any of the 8 data disks would bring the sys- 
tem to state <1>, and a failure of any of the 4 surrogate 
disks would bring the system to state <4>, where the 
redirected write data is not protected by redundancy. In 
state <4>, WorkOut also reclaims the redirected write 
data back to the 8-disk data RAID set, which bring the 
system back to state <0> and follows an exponential 
distribution of rate «3. In state <4>, a failure of any 
of the 8 data disks would bring the system to state <2> 
and a second disk failure in the 4-disk surrogate RAID5 
set would result in data loss. 

Since &1, Kg and «3 all represent the rate at which the 
redirected write data is reclaimed, it is reasonable to as- 
sume that they are equal to a fixed reclaim rate &, since 
the amount of redirected write data should be roughly the 
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Figure 13: Comparisons of the mean times to data 
loss. Note: The normalized baselines are the MTTDLs 
achieved by PR driven by the three traces respectively. 


same and the rate of transferring this data should also be 
the same under reasonable circumstances for all the three 
cases. Since the expression for MTTDL of Figure 12 is 
too complex (ratio of two large polynomials) to be dis- 
played here, we present the computed values of MTTDL 
in the following figure instead. 

Figure 13 plots comparisons of MTTDLs achieved by 
PR, PRO and WorkOut, which are normalized to the 
MTTDLs achieved by PR driven by the three traces re- 
spectively. The disk failure rate j: is assumed to be one 
failure every one hundred thousand hours, which is a 
conservative estimate to the values quoted by disk man- 
ufactures. Disk repair times, when the individual disk 
capacity is 250GB, are 25 times the results listed in Ta- 
ble 3, where the capacity of each disk is limited to 1OGB. 
K is assumed to be equal to the corresponding v, which 
is actually overestimated. From Figure 13, one can see 
that WorkOut increases MTTDL and improves reliabil- 
ity with the decreasing MTTR, especially for the write- 
intensive trace (i.e., Finl). Moreover, if we alter « to 
be several times smaller or larger than v, the black bar 
remains almost unchanged, suggesting that the reclaim 
time does not affect the reliability of the RAID system 
and thus can be excluded from the reconstruction time. 


6 Related Work 


Reconstruction algorithms and task scheduling. A 
large number of different approaches for improving re- 
construction performance have been studied. Some of 
these approaches focus on improved RAID reconstruc- 
tion algorithms, such as DOR (Disk-Oriented Recon- 
struction) [12], PR [22], PRO [41] and others [3, 13, 14, 
20, 38, 47, 48, 49]. Other approaches focus on better data 
layout in a RAID set [13, 47, 49]. Moreover, many task 
scheduling techniques have been proposed to optimize 
the background applications [25, 40, 44]. 

While all the the above algorithms focus on improv- 
ing performance by optimizing the organization of work 
within a single RAID set, our work takes a different ap- 
proach. The goal behind WorkOut is to increase perfor- 
mance during reconstruction by outsourcing I/O work- 
loads away from the degraded RAID set. Importantly, 


USENIX Association 


USENIX Association 


WorkOut is orthogonal to and can further improve the 
above techniques. 

Data migration. Our study is related in spirit to write 
off-loading [27, 28] and data migration [2, 19, 21, 24, 46] 
techniques, but with distinctively different characteris- 
tics. Write off-loading [27] redirects writes from one 
volume to another, to prolong the idle period for one vol- 
ume allowing the system to spin down disks for saving 
energy. Similar in spirit, Everest [28] off-loads writes 
from overloaded volumes to lightly loaded ones to im- 
prove performance during peaks. 

Data migration [24] moves data from one storage de- 
vice to another, e.g., for the purpose of load balancing 
(or load concentration), failure recovery, or system ex- 
pansion. Data migration has been used in the context 
of improving energy efficiency PARAID [46], improving 
performance by data reallocation (e.g., in the products of 
EMC’s Symmetrix family [2]), for read request offload- 
ing in Cuckoo [21] and for user-centric data migration in 
networked storage systems [19]. 

In contrast to write off-loading and data migration, 
WorkOut improves reconstruction performance by, tem- 
porarily redirecting writes and popular reads during re- 
construction and reclaiming the redirected write data 
back to the newly recovery RAID set after the recon- 
struction process completes. 


7 Future Work 


WorkOut is an ongoing research project and we are cur- 
rently exploring several directions for future work. 

Extendibility. In addition to RAID reconstruction 
and re-synchronization, other background support RAID 
tasks, such as disk scrubbing and block-level backup 
and snapshot, could benefit from WorkOut. We plan to 
conduct detailed experiments to measure the impact of 
WorkOut on these tasks. 

Flexibility. In the current implementation, we config- 
ure a reserved space instead of the free space on a live 
RAID set as the surrogate set, which can be impractical 
and inflexible. Utilizing the free space on a live RAID set 
at the file system level is complicated as the file system 
must be engaged to discover, assign, protect and manage 
the free space [37]. To make WorkOut more transpar- 
ent to the file system, and more effectively utilize the 
free space on a live surrogate RAID set, it would be de- 
sirable for WorkOut to obtain the liveness information at 
the block level. We will explore the live block techniques 
and apply them in WorkOut to improve its performance. 


8 Conclusion 


In this paper, for significantly boosting RAID reconstruc- 
tion performance, we propose WorkOut (I/O Workload 
Outsourcing) that outsources a significant amount of user 
I/O requests away from the degraded RAID set to a sur- 


rogate RAID set during reconstruction. We present a 
lightweight prototype of WorkOut implemented in the 
Linux software RAID. In a detailed experimental eval- 
uation, we demonstrate that, compared with the existing 
reconstruction algorithms PR and PRO, WorkOut signif- 
icantly speeds up the reconstruction time and average 
user response time simultaneously. Moreover, we pro- 
vide insights and guidance for storage system designers 
and administrators by exploiting three WorkOut design 
options based on their device overhead, performance, 
reliability, maintainability and trade-offs. Importantly, 
we demonstrate how WorkOut can be easily deployed to 
improve the performance of other background support 
RAID tasks such as re-synchronization. 
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Abstract 


Over the past five years, large-scale storage installations 
have required fault-protection beyond RAID-5, leading 
to a flurry of research on and development of erasure 
codes for multiple disk failures. Numerous open-source 
implementations of various coding techniques are avail- 
able to the general public. In this paper, we perform 
a head-to-head comparison of these implementations in 
encoding and decoding scenarios. Our goals are to com- 
pare codes and implementations, to discern whether the- 
ory matches practice, and to demonstrate how parameter 
selection, especially as it concerns memory, has a signifi- 
cant impact on a code’s performance. Additional benefits 
are to give storage system designers an idea of what to 
expect in terms of coding performance when designing 
their storage systems, and to identify the places where 
further erasure coding research can have the most im- 
pact. 


1 Introduction 


In recent years, erasure codes have moved to the fore 
to prevent data loss in storage systems composed of 
multiple disks. Storage companies such as Allmy- 
data [1], Cleversafe [7], Data Domain [36], Network Ap- 
pliance [22] and Panasas [32] are all delivering prod- 
ucts that use erasure codes beyond RAID-5 for data 
availability. _ Academic projects such as LoCI [3], 
Oceanstore [29], and Pergamum [31] are doing the same. 
And of equal importance, major technology corporations 
such as Hewlett Packard [34], IBM [12, 13] and Mi- 
crosoft [15, 16] are performing active research on erasure 
codes for storage systems. 

Along with proprietary implementations of erasure 
codes, there have been numerous open source implemen- 
tations of a variety of erasure codes that are available 
for download [7, 19, 23, 26, 33]. The intent of most 
of these projects is to provide storage system developers 
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with high quality tools. As such, there is a need to un- 
derstand how these codes and implementations perform. 

In this paper, we compare the encoding and decoding 
performance of five open-source implementations of five 
different types of erasure codes: Classic Reed-Solomon 
codes [28], Cauchy Reed-Solomon codes [6], EVEN- 
ODD [4], Row Diagonal Parity (RDP) [8] and Minimal 
Density RAID-6 codes [5, 24, 25]. The latter three codes 
are specific to RAID-6 systems that can tolerate exactly 
two failures. Our exploration seeks not only to compare 
codes, but also to understand which features and param- 
eters lead to good coding performance. 

We summarize the main results as follows: 


e The special-purpose RAID-6 codes vastly outper- 
form their general-purpose counterparts. RDP per- 
forms the best of these by a narrow margin. 


e Cauchy Reed-Solomon coding outperforms classic 
Reed-Solomon coding significantly, as long as at- 
tention is paid to generating good encoding matri- 
ces. 


e An optimization called Code-Specific Hybrid Re- 
construction [14] is necessary to achieve good de- 
coding speeds in many of the codes. 


Parameter selection can have a huge impact on how 
well an implementation performs. Not only must 
the number of computational operations be consid- 
ered, but also how the code interacts with the mem- 
ory hierarchy, especially the caches. 


e There is a need to achieve the levels of improvement 
that the RAID-6 codes show for higher numbers of 
failures. 


Of the five libraries tested, Zfec [33] implemented the 
fastest classic Reed-Solomon coding, and Jerasure [26] 
implemented the fastest versions of the others. 
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2 Nomenclature and Erasure Codes 


It is an unfortunate consequence of the history of era- 
sure coding research that there is no unified nomencla- 
ture for erasure coding. We borrow terminology mostly 
from Hafner et al [14], but try to conform to more classic 
coding terminology (e.g. [5, 21]) when appropriate. 

Our storage system is composed of an array of n 
disks, each of which is the same size. Of these n disks, k 
of them hold data and the remaining m hold coding in- 
formation, often termed parity, which is calculated from 
the data. We label the data disks Dop,..., Dx—1 and the 
parity disks Co,...,Cm_—1. A typical system is pictured 
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Figure 1: A typical storage system with erasure coding. 


We are concerned with Maximum Distance Separa- 
ble (MDS) codes, which have the property that if any m 
disks fail, the original data may be reconstructed [21]. 
When encoding, one partitions each disk into strips of 
a fixed size. Each parity strip is encoded using one 
strip from each data disk, and the collection of k + m 
strips that encode together is called a stripe. Thus, as 
in Figure 1, one may view each disk as a collection of 
strips, and one may view the entire system as a collec- 
tion of stripes. Stripes are each encoded independently, 
and therefore if one desires to rotate the data and parity 
among the n disks for load balancing, one may do so by 
switching the disks’ identities for each stripe. 


2.1 Reed-Solomon (RS) Codes 


Reed-Solomon codes [28] have the longest history. The 
strip unit is a w-bit word, where w must be large enough 
thatn < 2” + 1. So that words may be manipulated 
efficiently, w is typically constrained so that words fall 
on machine word boundaries: w € {8, 16, 32,64}. How- 
ever, as long as n < 2” + 1, the value of w may be 
chosen at the discretion of the user. Most implementa- 
tions choose w = 8, since their systems contain fewer 
than 256 disks, and w = 8 performs the best. Reed- 
Solomon codes treat each word as a number between 0 
and 2“ — 1, and operate on these numbers with Galois 
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Field arithmetic (GF'(2™)), which defines addition, mul- 
tiplication and division on these words such that the sys- 
tem is closed and well-behaved [21]. 

The act of encoding with Reed-Solomon codes is sim- 
ple linear algebra. A Generator Matrix is constructed 
from a Vandermonde matrix, and this matrix is multiplied 
by the & data words to create a codeword composed of 
the k data and m coding words. We picture the process 
in Figure 2 (note, we draw the transpose of the Generator 
Matrix to make the picture clearer). 
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Figure 2: Reed-Solomon coding for & = 4 and m = 2. 
Each element is a number between 0 and 2” — 1. 


When disks fail, one decodes by deleting rows of G i, 
inverting it, and multiplying the inverse by the surviving 
words. This process is equivalent to solving a set of inde- 
pendent linear equations. The construction of G7 from 
the Vandermonde matrix ensures that the matrix inver- 
sion is always successful. 

In GF(2”), addition is equivalent to bitwise 
exclusive-or (XOR), and multiplication is more com- 
plex, typically implemented with multiplication tables 
or discrete logarithm tables [11]. For this reason, Reed- 
Solomon codes are considered expensive. There are sev- 
eral open-source implementations of RS coding, which 
we detail in Section 3. 


2.2 Cauchy Reed-Solomon (CRS) Codes 


CRS codes [6] modify RS codes in two ways. First, 
they employ a different construction of the Generator 
matrix using Cauchy matrices instead of Vandermonde 
matrices. Second, they eliminate the expensive multipli- 
cations of RS codes by converting them to extra XOR 
operations. Note, this second modification can apply to 
Vandermonde-based RS codes as well. This modifica- 
tion transforms GT from an x k matrix of w-bit words 
toa wn x wk matrix of bits. As with RS coding, w must 
be selected so thatn < 2” +1. 

Instead of operating on single words, CRS coding op- 
erates on entire strips. In particular, strips are partitioned 
into w packets, and these packets may be large. The act 
of coding now involves only XOR operations — a coding 
packet is constructed as the XOR of all data packets that 


USENIX Association 


USENIX Association 


have a one bit in the coding packet’s row of G7. The 
process is depicted in Figure 3, which illustrates how the 
last coding packet is created as the XOR of the six data 
packets identified by the last row of G?. 
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Figure 3: CRS example for & = 4 and m = 2. 


To make XORs efficient, the packet size must be a 
multiple of the machine’s word size. The strip size is 
therefore equal to w times the packet size. Since w no 
longer relates to the machine word sizes, w is not con- 
strained to [8, 16, 32, 64]; instead, any value of w may be 
selected as long asn < 2”. 

Decoding in CRS is analogous to RS coding — all 
rows of G7 corresponding to failed packets are deleted, 
and the matrix is inverted and employed to recalculate 
the lost data. 

Since the performance of CRS coding is directly re- 
lated to the number of ones in G7, there has been re- 
search on constructing Cauchy matrices that have fewer 
ones than the original CRS constructions [27]. The Jera- 
sure library [26] uses additional matrix transformations 
to improve these matrices further. Additionally, in the 
restricted case when m = 2, the Jerasure library uses re- 
sults of a previous enumeration of all Cauchy matrices to 
employ provably optimal matrices for all w < 32 [26]. 


2.3. EVENODD and RDP 


EVENODD [4] and RDP [8] are two codes developed for 
the special case of RAID-6, which is when m = 2. Con- 
ventionally in RAID-6, the first parity drive is labeled P, 
and the second is labeled Q. The P drive is equivalent to 
the parity drive in a RAID-4 system, and the Q drive is 
defined by parity equations that have distinct patterns. 
Although their original specifications use different 
terms, EVENODD and RDP fit the same paradigm as 
CRS coding, with strips being composed of w packets. 
In EVENODD, w is constrained such thatk +1 < w 
and w-++1 is a prime number. In RDP, w+ 1 must be prime 
and k < w. Both codes perform the best when (w—&) is 
minimized. In particular, RDP achieves optimal encod- 
ing and decoding performance of (&—1) XOR operations 
per coding word when k = w ork +1 = w. Both codes’ 
performance decreases as (w — k) increases. 


One strip = w packets { 
wf > Data 
* 
Parity 


2.4 Minimal Density RAID-6 Codes 


If we encode using a Generator bit-matrix for RAID- 
6, the matrix is quite constrained. In particular, the 
first kw rows of G7 compose an identity matrix, and in 
order for the P drive to be straight parity, the next w 
rows must contain k identity matrices. The only flex- 
ibility in a RAID-6 specification is the composition of 
the last w rows. In [5], Blaum and Roth demonstrate 
that when k < w, these remaining w rows must have 
at least kw + k — 1 ones for the code to be MDS. We 
term MDS matrices that achieve this lower bound Mini- 
mal Density codes. 

There are three different constructions of Minimal 
Density codes for different values of w: 


e Blaum-Roth codes when w + 1 is prime [5]. 
e Liberation codes when w is prime [25]. 


e The Liber8tion code when w = 8 [24]. 


These codes share the same performance characteris- 
tics. They encode with (k — 1) + a XOR operations 
per coding word. Thus, they perform better when w 
is large, achieving asymptotic optimality as w — oo. 
Their decoding performance is slightly worse, and re- 
quires a technique called Code-Specific Hybrid Recon- 
struction [14] to achieve near-optimal performance [25]. 

The Minimal Density codes also achieve near-optimal 
updating performance when individual pieces of data are 
modified [27]. This performance is significantly better 
than EVENODD and RDP, which are worse by a factor 
of roughly 1.5 [25]. 


2.5 Anvin’s RAID-6 Optimization 


In 2007, Anvin posted an optimization of RS encoding 
for RAID-6 [2]. For this optimization, the row of GT 
corresponding to the P drive contains all ones, so that 
the P drive may be parity. The row corresponding to 
the Q drive contains the number 2° in GF'(2”) in col- 
umn 7 (zero-indexed), so that the contents of the Q drive 
may be calculated by successively XOR-ing drive 2’s 
data into the Q drive and multiplying that sum by two. 
Since multiplication by two may be implemented much 
faster than general multiplication in GF'(2”), this op- 
timizes the performance of encoding over standard RS 
implementations. Decoding remains unoptimized. 


3. Open Source Libraries 


We test five open source erasure coding libraries. These 
are all freely available libraries from various sources 
on the Internet, and range from brief proofs of concept 
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(e.g. Luby) to tuned and supported code intended for use 
in real systems (e.g. Zfec). We also tried the Schifra open 
source library [23], which is free but without documen- 
tation. We were unable to implement an encoder and 
decoder to perform a satisfactory comparison with the 
others. We present them chronologically. 

Luby: CRS coding was developed at the ICSI lab in 
Berkeley, CA in the mid 1990’s [6]. The authors released 
a C version of their codes in 1997, which is available 
from ICSI’s web site [19]. The library supports all set- 
tings of k, m, w and packet sizes. The matrices employ 
the original constructions from [6], which are not opti- 
mized to minimize the number of ones. 

Zfec: The Zfec library for erasure coding has been in 
development since 2007, but its roots have been around 
for over a decade. Zfec is built on top of a RS coding 
library developed for reliable multicast by Rizzo [30]. 
That library was based on previous work by Karn et 
al [18], and has seen wide use and tuning. Zfec is based 
on Vandermonde matrices when w = 8. The latest ver- 
sion (1.4.0) was posted in January, 2008 [33]. The library 
is programmable, portable and actively supported by the 
author. It includes command-line tools and APIs in C, 
Python and Haskell. 

Jerasure: Jerasure is a C library released in 2007 
that supports a wide variety of erasure codes, including 
RS coding, CRS coding, general Generator matrix and 
bit-matrix coding, and Minimal Density RAID-6 cod- 
ing [26]. RS coding may be based on Vandermonde or 
Cauchy matrices, and w may be 8, 16 or 32. Anvin’s 
optimization is included for RAID-6 applications. For 
CRS coding, Jerasure employs provably optimal encod- 
ing matrices for RAID-6, and constructs optimized ma- 
trices for larger values of m. Additionally, the three Min- 
imal Density RAID-6 codes are supported. To improve 
performance of the bit-matrix codes, especially the de- 
coding performance, the Code-Specific Hybrid Recon- 
struction optimization [14] is included. Jerasure is re- 
leased under the GNU LGPL. 

Cleversafe: In May, 2008, Cleversafe exported the 
first open source version of its dispersed storage sys- 
tem [7]. Written entirely in Java, it supports the same 
API as Cleversafe’s proprietary system, which is notable 
as one of the first commercial distributed storage systems 
to implement availability beyond RAID-6. For this pa- 
per, we obtained a version containing just the the erasure 
coding part of the open source distribution. It is based on 
Luby’s original CRS implementation [19] with w = 8. 

EVENODD/RDP: There are no open source versions 
of EVENODD or RDP coding. However, RDP may be 
implemented as a bit-matrix code, which, when com- 
bined with Code-Specific Hybrid Reconstruction yields 
the same performance as the original specification of the 
code [16]. EVENODD may also be implemented with a 
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bit-matrix whose operations may be scheduled to achieve 
the code’s original performance [16]. We use these ob- 
servations to implement both codes as bit-matrices with 
tuned schedules in Jerasure. Since EVENODD and RDP 
codes are patented, this implementation is not available 
to the public, as its sole intent is for performance com- 
parison. 


4 Encoding Experiment 


We perform two sets of experiments — one for encoding 
and one for decoding. For the encoding experiment, we 
seek to measure the performance of taking a large data 
file and splitting and encoding it into n = k + m pieces, 
each of which will reside on a different disk, making the 
system tolerant to up to m disk failures. Our encoder 
thus reads a data file, encodes it, and writes it to k + 
m data/coding files, measuring the performance of the 
encoding operations. 









































Block 


CG, File C, 


Coding 
Buffer 


Figure 4: The encoder utilizes a data buffer and a coding 
buffer to encode a large file in stages. 
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Since memory utilization is a concern, and since large 
files exceed the capacity of most computers’ memo- 
ries, our encoder employs two fixed-size buffers, a Data 
Buffer partitioned into k blocks and a Coding Buffer par- 
titioned into m blocks. The encoder reads an entire data 
buffer’s worth of data from the big file, encodes it into the 
coding buffer and then writes the contents of both buffers 
to k + m separate files. It repeats this process until the 
file is totally encoded, recording both the total time and 
the encoding time. The high level process is pictured in 
Figure 4. 

The blocks of the buffer are each partitioned into s 
strips, and each strip is partitioned either into words 
of size w (RS coding, where w € {8,16,32,64}), 
or into w packets of a fixed size PS (all other codes 
— recall Figure 3). To be specific, each block D; 
(and C’;) is partitioned into strips DSj;9,..., DSi,s—1. 
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(and C'S;0,...,CS;,s-1), each of size wPS. Thus, 
the data and coding buffer sizes are dependent on 
the various parameters. Specifically, the data buffer 
size equals (kswPS) and the coding buffer size 
equals (mswPS). 

Encoding is done on a stripe-by-stripe basis. First, 
the data strips DSo0,..., DS,—1,0 are encoded into the 
coding strips C’So,9,...,C’ Sm-—1,0. This completes the 
encoding of stripe 0, pictured in Figure 5. Each of the s 
stripes is successively encoded in this manner. 
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Thus, there are multiple parameters that the encoder 
allows the user to set. These are k, m, w (subject to the 
code’s constraints), s and PS. When we mention setting 
the buffer size below, we are referring to the size of the 
data buffer, which is (kswPS). 


4.1 Machines for Experimentation 


We employed two machines for experimentation. Nei- 
ther is exceptionally high-end, but each represents 
middle-range commodity processors, which should be 
able to encode and decode comfortably within the I/O 
speed limits of the fastest disks. The first is a Macbook 
with a 32-bit 2GHz Intel Core Duo processor, with 1GB 
of RAM, a L1 cache of 32KB and a L2 cache of 2MB. 
Although the machine has two cores, the encoder only 
utilizes one. The operating system is Mac OS X, version 
10.4.11, and the encoder is executed in user space while 
no other user programs are being executed. As a base- 
line, we recorded a memepy() speed of 6.13 GB/sec and 
an XOR speed of 2.43 GB/sec. 

The second machine is a Dell workstation with an Intel 
Pentium 4 CPU running at 1.5GHz with 1GB of RAM, 
an 8KB L1 cache and a 256KB L2 cache. The operating 
system is Debian GNU Linux revision 2.6.8-2-686, and 


the machine is a 32-bit machine. The memepy() speed 
is 2.92 GB/sec and the XOR speed is 1.32 GB/sec. 


4.2 Encoding with Large Files 


Our intent was to measure the actual performance of en- 
coding a large video file. However, doing large amounts 
of I/O causes a great deal of variability in performance 
timings. We exemplify with Figure 6. The data is from 
the Macbook, where we use a 256 MB video file for 
input. The encoder works as described in Section 4 
with k = 10 and m = 6. However, we perform no real 
encoding. Instead we simply zero the bytes of the coding 
buffer before writing it to disk. In the figure, we modify 
the size of the data buffer from a small size of 64 KB to 
256 MB - the size of the video file itself. 
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Figure 6: Times to read a 256 MB video, peform a 
dummy encoding when &} = 10 and m = 6, and write 
to 16 data/coding files. 


In Figure 6, each data point is the result of ten runs 
executed in random order. A tukey plot is given, which 
has bars to the maximum and minimum values, a box en- 
compassing the first to the third quartile, hash marks at 
the median and a dot at the mean. While there is a clear 
trend toward improving performance as the data buffer 
grows to 128 MB, the variability in performance is colos- 
sal: between 15 and 20 seconds for many runs. Running 
Unix’s split utility on the file reveals similar variability. 

Because of this variability, the tests that follow remove 
the I/O from the encoder. Instead, we simulate reading 
by filling the buffer with random bytes, and we simulate 
writing by zeroing the buffers. This reduces the vari- 
ability of the runs tremendously — the results that follow 
are all averages of over 10 runs, whose maximum and 
minimum values differ by less than 0.5 percent. The 
encoder measures the times of all coding activites us- 
ing Unix’s gettimeofday(). To confirm that these times 
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are accurate, we also subtracted the wall clock time of a 
dummy control from the wall clock time of the encoder, 
and the two matched to within one percent. 

Figure 6 suggests that the size of the data buffer can 
impact performance, although it is unclear whether the 
impact comes from memory effects or from the file sys- 
tem. To explore this, we performed a second set of tests 
that modify the size of the data buffer while performing 
a dummy encoding. We do not graph the results, but 
they show that with the I/O removed, the effects of mod- 
ifying the buffer size are negligible. Thus, in the results 
that follow, we maintain a data buffer size of roughly 100 
KB. Since actual buffer sizes depend on k, m, w and PS, 
they cannot be affixed to a constant value; instead, they 
are chosen to be in the ballpark of 100 KB. This is large 
enough to support efficient I/O, but not so large that it 
consumes all of a machine’s memory, since in real sys- 
tems the processors may be multitasking. 


4.3 Parameter Space 


We test four combinations of k and m — we will denote 
them by [&,m]. Two combinations are RAID-6 scenar- 
ios: [6,2] and [14,2]. The other two represent 16-disk 
stripes with more fault-tolerance: [12,4] and [10,6]. We 
chose these combinations because they represent values 
that are likely to be seen in actual usage. Although 
large and wide-area storage installations are composed 
of much larger numbers of disks, the stripe sizes tend to 
stay within this medium range, because the benefits of 
large stripe sizes show diminishing returns compared to 
the penalty of extra coding overhead in terms of encod- 
ing performance and memory use. For example, Clever- 
safe’s widely dispersed storage system uses [10,6] as its 
default [7]; Allmydata’s archival online backup system 
uses [3,7], and both Panasas [32] and Pergamum [31] re- 
port keeping their stripe sizes at or under 16. 

For each code and implementation, we test its perfor- 
mance by encoding a randomly generated file that is 1 
GB in size. We test all legal values of w < 32. This 
results in the following tests. 


e Zfec: RS coding, w = 
of [k, m]. 


8 for all combinations 


e Luby: CRS coding, w € {4,...,12} for all combi- 
nations of [k, m], and w = 8 for [6,2]. 


Cleversafe: CRS coding, w = 8 for all combina- 
tions of [k, m]. 


Jerasure: 


— RS coding, w € {8, 16,32} for all combina- 
tions of [k,m]. Anvin’s optimization is in- 
cluded for the RAID-6 tests. 
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— CRS coding, w € {4,...,32} for all combi- 
nations of [k, m], and w = 3 for [6,2]. 


- Blaum-Roth codes, w € {6, 10,12} for [6,2] 
and w € {16,18,22,28, 30} for [6,2] and 
[14,2]. 


- Liberation codes, w € {7,11,13} for [6,2] 
and w € {17,19,23,29,31} for [6,2] and 
[14,2]. 


— The Liber8tion code, w = 8 for [6,2]. 


e EVENODD: Same parameters as Blaum-Roth codes 
in Jerasure above. 


e RDP: Same parameters as EVENODD. 


4.4 Impact of the Packet Size 


Our experience with erasure coding led us to experi- 
ment first with modifying the packet sizes of the en- 
coder. There is a clear tradeoff: lower packet sizes have 
less tight XOR loops, but better cache behavior. Higher 
packet sizes perform XORs over larger regions, but cause 
more cache misses. To exemplify, consider Figure 7, 
which shows the performance of RDP on the [6,2] con- 
figuration when w = 6, on the Macbook. We test every 
packet size from 4 to 10000 and display the speed of en- 
coding. 
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Figure 7: The effect of modifying the packet size on RDP 
coding, k = 6, m = 2, w = 6 on the Macbook. 


We display two y-axes. On the left is the encoding 
speed. This is the size of the input file divided by the time 
spent encoding and is the most natural metric to plot. On 
the right, we normalize the encoding speed so that we 
may compare the performance of encoding across con- 
figurations. The normalized encoding speed is calculated 
as: 

(Encoding Speed) m(k — 1) 
SS (1) 
This is derived as follows. Let S be the file’s size and t 
be the time to encode. The file is split and encoded 
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into m + k files, each of size 2. The encoding process 
itself creates Sm bytes worth of coding data, and there- 
fore the speed per coding byte is sm Optimal encoding 
takes k — 1 XOR operations per coding drive [35]; there- 
fore we can normalize the speed by dividing the time 
by & — 1, leaving us with Sm(k-1) or Equation 1 for 
the normalized encoding speed. 

The shape of this curve is typical for all codes on both 
machines. In general, higher packet sizes perform bet- 
ter than lower ones; however there is a maximum perfor- 
mance point which is achieved when the code makes best 
use of the L1 cache. In this test, the optimal packet size 
is 2400 bytes, achieving a normalized encoding speed of 
2172 MB/sec. Unfortunately, this curve does not mono- 
tonically increase to nor decrease from its optimal value. 
Worse, there can be radical dips in performance between 
adjacent packet sizes, due to collisions between cache 
entries. For example, at packet sizes 7732, 7736 and 
7740, the normalized encoding speeds are 2133, 2066 
and 2129 MB/sec, respectively. We reiterate that each 
data point in our graphs represents over 10 runs, and the 
repetitions are consistent to within 0.5 percent. 
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Figure 8: The effect of modifying w on the best packet 
sizes found. 


We do not attempt to find the optimal packet sizes for 
each of the codes. Instead, we perform a search algo- 
rithm that works as follows. We test a region r of packet 
sizes by testing each packet size from r to r + 36 (packet 
sizes must be a multiple of 4). We set the region’s perfor- 
mance to be the average of the five best tests. To start our 
search, we test all regions that are powers of two from 64 
to 32K. We then iterate, finding the best region r, and 
then testing the two regions that are halfway between the 
two values of r that we have tested that are adjacent to r. 
We do this until there are no more regions to test, and se- 


lect the packet size of all tested that performed the best. 
For example, the search for the RDP instance of Figure 7 
tested only 202 packet sizes (as opposed to 2500 to gen- 
erate Figure 7) to arrive at a packet size of 2588 bytes, 
which encodes at a normalized speed of 2164 MB/sec 
(0.3% worse than the best packet size of 2400 bytes). 

One expects the optimal packet size to decrease 
as k, m and w increase, because each of these increases 
the stripe size. Thus smaller packets are necessary for 
most of the stripe to fit into cache. We explore this effect 
in Figure 8, where we show the best packet sizes found 
for different sets of codes - RDP, Minimum Density, and 
Jerasure’s CRS — in the two RAID-6 configurations. For 
each code, the larger value of k results in a smaller packet 
size, and as a rough trend, as w increases, the best packet 
size decreases. 


4.5 Overall Encoding Performance 


We now present the performance of each of the codes 
and implementations. In the codes that allow a packet 
size to be set, we select the best packet size from the 
above search. The results for the [6,2] configuration are 
in Figure 9. 

Although the graphs for both machines appear similar, 
there are interesting features of both. We concentrate first 
on the MacBook. The specialized RAID-6 codes outper- 
form all others, with RDP’s performance with w = 6 
performing the best. This result is expected, as RDP 
achieves optimal performance when fk = w. 

The performance of these codes is typically quantified 
by the number of XOR operations performed [5, 4, 8, 
25, 24]. To measure how well number of XORs matches 
actual performance, we present the number of gigabytes 
XOR’d by each code in Figure 10. 

On the MacBook, the number of XORs is an excellent 
indicator of performance, with a few exceptions (CRS 
codes for w € {21,22,32}). As predicted by XOR 
count, RDP’s performance suffers as w increases, while 
the Minimal Density codes show better performance. Of 
the three special-purpose RAID-6 codes, EVENODD 
performs the worst, although the margins are not large 
(the worst performing EVENODD encodes at 89% of the 
speed of the best RDP). 

The performance of Jerasure’s implementation the 
CRS codes is also excellent, although the choice of w 
is very important. The number of ones in the CRS gen- 
erator matrices depends on the number of bits in the 
Galois Field’s primitive polynomial. The polynomials 
for w € {8,12,13,14,16,19, 24, 26, 27,30,32} have 
one more bit than the others, resulting in worse perfor- 
mance. This is important, as w € {8,16,32} are very 
natural choices since they allow strip sizes to be powers 
of two. 
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Figure 9: Encoding performance for [6,2]. 


Returning back to figure 9, the Luby and Cleversafe 
implementations of CRS coding perform much worse 
than Jerasure. There are several reasons for this. First, 
they do not optimize the generator matrix in terms of 
number of ones, and thus perform many more XOR op- 
erations, from 3.2 GB of XORs when w = 3 to 13.5 
GB when w = 12. Second, both codes use a dense, 
bit-packed representation of the generator matrix, which 
means that they spend quite a bit of time performing bit 
operations to check matrix entries, many of which are 
zeros and could be omitted. Jerasure converts the matrix 
to a schedule which eliminates all of the matrix traver- 
sal and entry checking during encoding. Cleversafe’s 
poor performance relative to Luby can most likely be at- 
tributed to the Java implementation and the fact that the 
packet size is hard coded to be very small (since Clever- 
safe routinely distributes strips in units of 1K). 


Of the RS implementations, the implementation tai- 
lored for RAID-6 (labeled “RS-Opt’) performs at a much 
higher rate than the others. This is due to the fact that 
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Figure 10: Gigabytes XOR’d by each code in the [6,2] 
tests. The number of XORs is independent of the ma- 
chine used. 


it does not perform general-purpose Galois Field multi- 
plication over w-bit words, but instead performs a ma- 
chine word’s worth of multiplication by two at a time. 
Its performance is better when w < 16, which is not 
a limitation as w = 16 can handle a system with a to- 
tal of 64K drives. The Zfec implementation of RS cod- 
ing outperforms the others. This is due to the heavily 
tuned implementation, which performs explicit loop un- 
rolling and hard-wires many features of GF(2°) which 
the other libraries do not. Both Zfec and Jerasure use pre- 
computed multiplication and division tables for GF'(2°). 
For w = 16, Jerasure uses discrete logarithms, and 
for w = 32, it uses a recursive table-lookup scheme. Ad- 
ditional implementation options for the underlying Ga- 
lois Field arithmetic are discussed in [11]. 

The results on the Dell are similar to the MacBook 
with some significant differences. The first is that larger 
values of w perform worse relative to smaller values, re- 
gardless of their XOR counts. While the Minimum Den- 
sity codes eventually outperform RDP for larger w, their 
overall performance is far worse than the best performing 
RDP instance. For example, Liberation’s encoding speed 
when w = 31 is 82% of RDP’s speed when w = 6, as 
opposed to 97% on the MacBook. We suspect that the 
reason for this is the smaller L1 cache on the Dell, which 
penalizes the strip sizes of the larger w. 

The final difference between the MacBook and the 
Dell is that Jerasure’s RS performance for w = 16 is 
much worse than for w = 8. We suspect that this is be- 
cause Jerasure’s logarithm tables are not optimized for 
space, consuming 1.5 MB of memory, since there are six 
tables of 256 KB each [26]. The lower bound is two 128 
KB tables, which should exhibit better behavior on the 
Dell’s limited cache. 

Figure 11 displays the results for [14,2] (we omit 
Cleversafe since its performance is so much worse than 
the others). The trends are similar to [6,2], with the ex- 
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Figure 11: Encoding performance for [14,2]. 


ception that on the Dell, the Minimum Density codes 
perform significantly worse than RDP and EVENODD, 
even though their XOR counts follow the performance of 
the MacBook. The definition of the normalized encoding 
speed means that if a code is encoding optimally, its nor- 
malized encoding speed should match the XOR speed. In 
both machines, RDP’s [14,2] normalized encoding speed 
comes closest to the measured XOR speed, meaning that 
in implementation as in theory, this is an extremely effi- 
cient code. 

Figure 12 displays the results for [12,4]. Since this 
is no longer a RAID-6 scenario, only the RS and CRS 
codes are displayed. The normalized performance of 
Jerasure’s CRS coding is much worse now because the 
generator matrices are more dense and cannot be opti- 
mized as they can when m = 2. As such, the codes per- 
form more XOR operations than when / = 14. For ex- 
ample, when w = 4 Jerasure’s CRS implementation per- 
forms 17.88 XORs per coding word; optimal is 11. This 
is why the normalized coding speed is much slower than 
in the best RAID-6 cases. Since Luby’s code does not 
optimize the generator matrix, it performs more XORs 
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Figure 12: Encoding performance for [12,4]. 


(23.5 per word, as opposed to 17.88 for Jerasure), and as 
a result is slower. 

The RS codes show the same performance as in the 
other tests. In particular, Zfec’s normalized performance 
is roughly the same in all cases. For space purposes, we 
omit the [10,6] results as they show the same trends as 
the [12,4] case. The peak performer is Jerasure’s CRS, 
achieving a normalized speed of 1409 MB/sec on the 
MacBook and 869.4 MB/sec on the Dell. Zfec’s nor- 
malized encoding speeds are similar to the others: 528.4 
MB/sec on the MacBook and 380.2 MB/sec on the Dell. 


5 Decoding Performance 


To test the performance of decoding, we converted the 
encoder program to perform decoding as well. Specif- 
ically, the decoder chooses m random data drives, and 
then after each encoding iteration, it zeros the buffers for 
those drives and decodes. We only decode data drives 
for two reasons. First, it represents the hardest decoding 
case, since all of the coding information must be used. 
Second, all of the libraries except Jerasure decode only 
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the data, and do not allow for individual coding strips to 
be re-encoded without re-encoding all of them. While we 
could have modified those libraries to re-encode individ- 
ually, we did not feel that it was in the spirit of the evalu- 
ation. Before testing, we wrote code to double-check that 
the erased data was decoded correctly, and in all cases it 
was. 
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Figure 13: Decoding performance for [6,2]. 


We show the performance of two configurations: [6,2] 
in Figure 13 and [12,4] in Figure 14. The results are 
best viewed in comparison to Figures 9 and 12. The re- 
sults on the MacBook tend to match theory. RDP de- 
codes as it encodes, and the two sets of speeds match 
very closely. EVENODD and the Minimal Density codes 
both have slightly more complexity in decoding, which is 
reflected in the graph. As mentioned in [24], the Minimal 
Density codes benefit greatly from Code-Specific Hybrid 
Reconstruction [14], which is implemented in Jerasure. 
Without the optimization, the decoding performance of 
these codes would be unacceptable. For example, in the 
[6,2] configuration on the MacBook, the Liberation code 
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for w = 31 decodes at a normalized rate of 1820 MB/sec. 
Without Code-Specific Hybrid Reconstruction, the rate is 
a factor of six slower: 302.7 MB/sec. CRS coding also 
benefits from the optimization. Again, using an example 
where w = 31, normalized speed with the optimization 
is 1809 MB/s, and without it is 261.5 MB/sec. 

The RS decoders perform identically to their encoding 
counterparts with the exception of the RAID-6 optimized 
version. This is because the optimization applies only to 
encoding and defaults to standard RS decoding. Since 
the only difference between RS encoding and decoding 
is the inversion of a k x k matrix, the fact that encoding 
and decoding performance match is expected. 

On the Dell, the trends between the various codes fol- 
low the encoding tests. In particular, larger values of w 
are penalized more by the small cache. 
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Figure 14: Decoding performance for [12,4]. 


In the [12,4] tests, the performance trends of the CRS 
codes are the same, although the decoding proceeds more 
slowly. This is more pronounced in Jerasure’s imple- 
mentation than in Luby’s, and can be explained by XORs. 
In Jerasure, the program attempts to minimize the num- 
ber of ones in the encoding matrix, without regard to the 
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decoding matrix. For example, when w = 4, CRS encod- 
ing requires 5.96 GB of XORs. In a decoding example, 
it requires 14.1 GB of XORs, and with Code-Specific 
Hybrid Reconstruction, that number is reduced to 12.6. 
Luby’s implementation does not optimize the encoding 
matrix, and therefore the penalty of decoding is not as 
great. 

As with the [6,2] tests, the performance of RS coding 
remains identical to decoding. 


6 XOR Units 


This section is somewhat obvious, but it does bear 
mentioning that the unit of XOR used by the encod- 
ing/decoding software should match the largest possible 
XOR unit of the machine. For example, on 32-bit ma- 
chines like the MacBook and the Dell, the long and int 
types are both four bytes, while the char and short types 
are one and two bytes, respectively. On 64-bit machines, 
the long type is eight bytes. To illustrate the dramatic 
impact of word size selection for XOR operations, we 
display RDP performance for the [6,2] configuration (w 
= 6) on the two 32-bit machines and on a an HP dce7600 
workstation with a 64-bit Pentium D860 processor run- 
ning at 2.8 GHz. The results in Figure 15 are expected. 
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Figure 15: Effect of changing the XOR unit of RDP en- 
coding when w = 6 in the [6,2] configuration. 


The performance penalty at each successively smaller 
word size is roughly a factor of two, since twice as many 
XORs are being performed. All the libraries tested in 
this paper perform XORs with the widest word possible. 
This also displays how 64-bit are especially tailored for 
these types of operations. 


7 Conclusions 


Given the speeds of current disks, the libraries explored 
here perform at rates that are easily fast enough to build 
high performance, reliable storage systems. We offer the 
following lessons learned from our exploration and ex- 
perimentation: 


RAID-6: The three RAID-6 codes, plus Jerasure’s 
implementation of CRS coding for RAID-6, all perform 
much faster than the general-purpose codes. Attention 
must be paid to the selection of w for these codes: for 
RDP and EVENODD, it should be as low as possible; 
for Minimal Density codes, it should be as high as the 
caching behavior allows, and for CRS, it should be se- 
lected so that the primitive polynomial has a minimal 
number of ones. Note that w € {8, 16,32} are all bad 
for CRS coding. Anvin’s optimization is a significant im- 
provement ot generalized RS coding, but does not attain 
the levels of the special-purpose codes. 


CRS vs. RS: For non-RAID-6 applications, CRS cod- 
ing performs much better than RS coding, but now w 
should be chosen to be as small as possible, and atten- 
tion should be paid to reduce the number of ones in the 
generator matrix. Additionally, a dense matrix represen- 
tation should not be used for the generator matrix while 
encoding and decoding. 


Parameter Selection: In addition to w, the packet 
sizes of the codes should be chosen to yield good cache 
behavior. To achieve an ideal packet size, experimenta- 
tion is important; although there is a balance point be- 
tween too small and too large, some packet sizes per- 
form poorly due to direct-mapped cache behavior, and 
therefore finding an ideal packet size takes more effort 
than executing a simple binary search. As reported by 
Greenan with respect to Galois Field arithmetic [11], ar- 
chitectural features and memory behavior interact in such 
a way that makes it hard to predict the optimal param- 
eters for encoding operations. In this paper, we semi- 
automate it by using the region-based search of Sec- 
tion 4.4. 


Minimizing the Cache/Memory Footprint: On 
some machines, the implementation must pay attention 
to memory and cache. For example, Jerasure’s RS im- 
plementation performs poorly on the Dell when w = 16 
because it is wasteful of memory, while on the MacBook 
its memory usage does not penalize as much. Part of 
Zfec’s better performance comes from its smaller mem- 
ory footprint. In a similar vein, we have seen improve- 
ments in the performance of the XOR codes by re- 
ordering the XOR operations to minimize cache replace- 
ments [20]. We anticipate further performance gains 
through this technique. 


Beyond RAID-6: The place where future research 
will have the biggest impact is for larger values of m. 
The RAID-6 codes are extremely successful in delivering 
higher performance than their general-purpose counter- 
parts. More research needs to be performed on special- 
purpose codes beyond RAID-6, and implementations 
need to take advantage of the special-purpose codes that 
already exist [9, 10, 17]. 
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Multicore: As modern architectures shift more uni- 
versally toward multicore, it will be an additional chal- 
lenge for open source libraries to exploit the opportuni- 
ties of multiple processors on a board. As demonstrated 
in this paper, attention to the processor/cache interaction 
will be paramount for high performance. 
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Abstract 


Fault-tolerant services typically make assumptions about the 
type and maximum number of faults that they can tolerate 
while providing their correctness guarantees; when such a fault 
threshold is violated, correctness is lost. We revisit the notion 
of fault thresholds in the context of long-term archival storage. 
We observe that fault thresholds are inevitably violated in long- 
term services, making traditional fault tolerance inapplicable 
to the long-term. In this work, we undertake a “reallocation of 
the fault-tolerance budget” of a long-term service. We split the 
service into service pieces, each of which can tolerate a dif- 
ferent number of faults without failing (and without causing 
the whole service to fail): each piece can be either in a critical 
trusted fault tier, which must never fail, or an untrusted fault 
tier, which can fail massively and often, or other fault tiers in 
between. By carefully engineering the split of a long-term ser- 
vice into pieces that must obey distinct fault thresholds, we can 
prolong its inevitable demise. We demonstrate this approach 
with Bonafide, a long-term key-value store that, unlike all simi- 
lar systems proposed in the literature, maintains integrity in the 
face of Byzantine faults without requiring self-certified data. 
We describe the notion of tiered fault tolerance, the design, im- 
plementation, and experimental evaluation of Bonafide, and ar- 
gue that our approach is a practical yet significant improvement 
over the state of the art for long-term services. 


1 Introduction 


Current fault-tolerant replicated service designs are often 
unsuitable for long-term applications, such as archival 
storage for digital artifacts, which is gaining importance 
for business [42], regulatory [5,6], and cultural [36] rea- 
sons. This unsuitability results from the typical fault as- 
sumptions on which the correctness of such systems is 
conditioned. For example, in typical Byzantine-fault tol- 
erant (BFT) replicated systems [13], it is assumed that 
the number of faulty replicas is always less than some 
fixed threshold such as 1/3 of the replica population. 

In typical, “near-term” applications, such a uniform- 
threshold-based fault assumption can be reasonable and 
achievable. For example, one can argue that in a well- 
maintained population of diverse, high-assurance replica 
servers, by the time a third of the population is broken 
into or just grows faulty, the operators of faulty repli- 
cas can repair them. Thus, the repair reduces the number 
of faulty replicas, averts a threshold breach, and thereby 
keeps the system’s fault assumption inviolate. 


Scott Shenker, John Kubiatowicz 
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Unfortunately, this reasoning falls apart for applica- 
tions and deployments with a long-term horizon, say 
many decades. Whereas a population of replica servers 
can be plausibly “well-maintained” enough for a few 
years, it is difficult to protect perfectly from momen- 
tary threshold breaches over the long haul. Even im- 
probable correlated faults become probable given enough 
time [10]. Once that threshold is breached, for however 
brief a period, the system’s fault assumption is violated, 
and correctness can no longer be guaranteed at any point 
in time thereafter (Section 2.1). 

In this work, we focus on storage applications with a 
long-term horizon and design a replicated service model 
that suits them. We observe that the reliability of a sys- 
tem’s components over long spans of time can vary dra- 
matically. First, a complex but formally unverified soft- 
ware artifact might be likely to exhibit vulnerabilities; 
all that stands between it and a bug is a lapse in the judg- 
ment of a human programmer. However, a formally ver- 
ified software artifact might take much longer to exhibit 
vulnerabilities: it will not exhibit bugs against which it 
was verified, but perhaps the assumptions under which 
its correctness was verified might cease to hold upon a 
radical technology change (think of the transition from 
uniprocessors to multiprocessors as such a change). Tak- 
ing this rationale to its extreme, a trusted third party—a 
“component” in a distributed service, such as a root DNS 
server—might take even longer to fail: for example, even 
if all involved hardware and software components are op- 
erating as specified, the trusted component can fail if the 
organization operating it sells out to criminals. We argue 
that whereas this differentiation might be esoteric and 
moot for near-term services, it may be an unavoidable 
consideration for long-term applications. 

This observation leads us to a tiered fault frame- 
work for such replicated applications (Section 2.2). This 
framework partitions system components into different 
classes; for instance, software and hardware used for 
write operations is in a different class from software 
and hardware used for read-only operations. The frame- 
work treats components of one class across all nodes 
separately from components from another class, assign- 
ing a separate fault tier to each class. Like more tra- 
ditional models, the fault assumption within each tier 
is threshold-based, but the actual threshold differs from 
tier to tier. For instance, the fault assumption for the 
population of write-operation components may be a 1/3 
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threshold as with typical BFT systems, whereas the fault 
threshold for the population of read-operation compo- 
nents may be higher. There is no magic in this formu- 
lation: each fault tier is itself subject to a fault threshold. 
However, this multi-tier approach enables us to struc- 
ture a system so as to operate longer without violating 
its overall fault assumptions. 

One could informally view this tiered fault frame- 
work as a “reallocation of the dependability budget” 
across the different hardware and software components 
and across time. This stronger fault framework implies 
different operational practices for each component class: 
high-trust components must be formally verified before 
deployment—which might imply that they be limited in 
functionality or that they run infrequently and briefly, 
and are mostly off-line to reduce their attack surface— 
whereas lower-trust components might be larger, bug- 
gier, and running continuously. 

To make things concrete, we study a particular kind 
of long-term application: an authentic long-term key- 
value store. Such a facility can be useful, for instance, 
as a directory for finding sensitive data given a human- 
memorable name. One example is a directory for self- 
certifying names of stored files given a file’s human- 
friendly name. Such a service can close the loop for 
previously proposed reliable long-term archival services 
such as Glacier [25], Oceanstore [31], Pergamum [49], 
and Preservation DataStores [20], which can withstand 
Byzantine attacks only as long as a client holds a self- 
certifying name for a data item. This leaves out of scope 
the task of finding those human-unfriendly names by 
a user in the future, not to mention that today’s self- 
certifying names (e.g., the SHA-1 hash of a document) 
will not be certifying anything in the future if the technol- 
ogy behind them is defeated (this was recently demon- 
strated as inevitable for SHA-1 hashes [19]). 

Bonafide is such a key-value store that provides its 
correctness guarantees (integrity and liveness) under a 
tiered fault model (Section 3). Bonafide partitions the op- 
eration of a key-value store into three classes of compo- 
nents bound by three tiers of threshold-based fault as- 
sumptions. The lowest, most error-prone tier of Bonafide 
is occupied by the service process, a mechanism for re- 
sponding to the clients’ read-only requests (e.g., looking 
up existing key-value bindings) and for buffering—but 
not executing—new key-value additions. The middle tier 
contains the update process, which performs in batch all 
buffered key-value additions, but runs periodically and 
only briefly. The highest tier contains a minimal trusted 
facility for a moded, attested storage module (MAS), 
which keeps the error-prone service process safe and pro- 
tects the integrity of the update process. 

Bonafide provides its guarantees as long as no com- 
ponent of the trusted top tier and fewer than a third of 
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middle-tier components fail at the same time; any num- 
ber of bottom-tier components can fail. In addition, like 
other systems such as Carbonite [15], Bonafide offers 
durability (that is, does not lose stored key-value bind- 
ings) as long as the system creates copies of data faster 
than they are lost. 

Our prototype implementation provides a simple 
add/get interface and shows reasonable performance 
(Section 4). We note that most building blocks for 
Bonafide are borrowed from prior work, most notably 
from trusted primitives, authenticated data structures, 
proactive recovery, and BFT replicated state machines. 
It is the structuring of Bonafide as a service observing 
a tiered fault framework for long-term operation that we 
claim as novel. 

We discuss the tradeoffs and extensions of Bonafide in 
Section 5, describe related work in Section 6, and con- 
clude in Section 7. 


2 Tiered Fault Tolerance 


In this section, we demonstrate how a uniform-fault- 
threshold system model is not suitable for long-term ap- 
plications, and we introduce a tiered fault framework 
examining its feasibility for long-term applications. Al- 
though we focus here on Byzantine faults, the approach 
applies to weaker kinds of faults as well. 


2.1 Fault Assumption Violations 


We give here an example of how a system built on Cas- 
tro and Liskov’s popular Practical Byzantine Fault Toler- 
ance (PBFT) [13] protocol for replicating state machines 
breaks with even a transient violation of the fault thresh- 
old. In PBFT, an upper bound f on the number N of 
replicas allows the use of replica quorums (typically of 
size 2 + 1) to protect the safety and liveness of the sys- 
tem, only as long as N > 3f. Figure 1 illustrates a pop- 
ulation of N = 4 replicas, of which r; and rg are faulty, 
in violation of PBFT’s fault bound! (f = 1). Further- 
more, non-faulty replicas r9 and r3 cannot temporarily 
communicate with each other, e.g., due to transient in- 
terference such as DoS from the faulty replicas. Client 
a sends req, to the system. The two faulty replicas con- 
vince rg to commit and execute req, first, since the three 
of them form a quorum of 3 = 2f + 1. Later client b 
sends req, to the system. The two faulty replicas con- 
vince r3 to commit and execute req,, first, since r3 never 
saw req, Henceforth, all results that the two non-faulty 
replicas send back to clients will be dependent on diver- 
gent views of the system’s global history and state. 
Because only 2(= f + 1) matching replies are re- 
quired to convince a client of a result, even if the fault 
assumption is again met because one faulty replica is 
repaired, the remaining one faulty replica will always 
be able to corroborate ro’s view of the world to some 
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Figure 1: An example that shows the potential effects of a fault 
threshold violation in PBFT. Black circles are faulty replicas 
(one of them is the primary), gray circles are correct replicas, 
and white circles are clients. When two clients c; and c2 sub- 
mit requests req, and req, to the replicas at roughly the same 
time, but only manage to reach one correct replica each, the two 
faulty replicas can convince the two correct replicas to assign 
the same sequence number to different requests. 


clients and r3’s view of the world to other clients, keep- 
ing up the charade indefinitely. Crash-fault tolerant repli- 
cated state machines based on the Paxos [32] protocol do 
not deal with Byzantine faults explicitly (i.e., assume a 
Byzantine-fault threshold of 0) and they can have similar 
problems if a Byzantine fault crops up rarely and briefly. 


Though not general for all possible replicated state 
machine protocols, this illustration serves to demonstrate 
the common trend: once the fault assumption is violated 
(the same as a threshold breach in traditional BFT pro- 
tocols) the system cannot offer its correctness guarantees 
again, even if the fault assumption is later restored. 


The fault bound in the original PBFT protocol applies 
over the lifetime of the system, assuming that once a 
replica becomes faulty it does not recover. PBFT’s au- 
thors devised PBFT-PR, an enhanced protocol with some 
hardware support that attempts to repair faulty replicas. 
As a result, PBFT reduces the length of the vulnerabil- 
ity window of the system during which the fault bound 
might be breached; even though more than f faults may 
occur during the lifetime of the system, as long as faults 
are repaired frequently enough so that no more than f 
faults are ever simultaneously present, the system main- 
tains its guarantees. 


PBFT-PR achieves this repair using proactive recov- 
ery [14]: a hardware watchdog on every replica periodi- 
cally reboots it with a fresh software installation from a 
read-only medium (e.g., a CD-ROM), flushing any run- 
time code damage caused since the last reboot. Upon re- 
boot, the protocol cleans up the service state before it 
goes back into regular operation. Now the window of 
vulnerability” is the period of time between two suc- 
cessive, successful proactive recovery phases across the 
replica population, which is much shorter than the life- 
time of the system. However, if the f fault bound is vi- 
olated within a vulnerability window, the protocol fails 
once again. 


2.2 Tiered Fault Framework 


We observe that the traditional fault model mostly 
presents an either-good-or-bad view of the world. Nodes 
that are faulty are incorrect and there is nothing in be- 
tween. In reality, however, different components of nodes 
in a complex system exhibit different levels of fault tol- 
erance; this fact is also explored in the wormholes model 
for short-term applications [51,53] and we compare our 
approach with the wormholes model in detail in Sec- 
tion 6. In this work we argue that it may be unavoidable 
for long-term services to use a tiered fault framework, 
which exploits different levels of reliability in different 
components of the system. In this framework, different 
classes of components of the service implementation are 
assumed to keep their numbers of (Byzantine) faults un- 
der different thresholds. 


We believe that a tiered fault framework is desir- 
able because of the broad differentiation among software 
and hardware components. One source of differentia- 
tion comes from different assurance practices. Hardware 
microprocessor designs undergo extensive formal veri- 
fication before production and, though extremely com- 
plex, tend to exhibit fewer bugs and security vulnerabil- 
ities in their implementation than typical software sys- 
tems. Even in the software world, formally verified com- 
ponents can rigorously prove their correctness guaran- 
tees under specific execution models and, as a result, 
be protected from many runtime bugs and vulnerabili- 
ties [40,45]. This approach can be leveraged with some 
success and performance cost via the use of strongly 
typed languages such as Java and C#, which are touted 
as safer environments for building robust systems: they 
offer a formal guarantee that, as long as the execution 
runtime implements the language semantics correctly, no 
application will be vulnerable to some of the typical sys- 
tem plagues like buffer overflows. Similar guarantees are 
offered by language-based type-safe operating systems 
such as Singularity [27]. 


A second source of differentiation comes from care in 
the deployment of a system: tight physical access con- 
trols, proactive hardware and software replacement, re- 
sponsive system administration, well-designed firewalls 
and intrusion detection mechanisms contribute to keep- 
ing out the threats that can exploit any vulnerabilities 
present in the physical and logical interfaces of a system 
component. For example, a software component that is 
vulnerable to a particular exploit borne over SSH traffic 
can be shielded from that exploit if the firewalls between 
the Internet and that component drop all SSH packets be- 
fore they reach it [54], or if it only communicates with 
other trusted components over a private network [18,46]. 


A third source of differentiation comes from the 
rolling procurement characteristics of the software and 
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hardware technologies in long-term services. Unlike typ- 
ical “near-term” systems, it is not the data that “flow” 
through the hardware and software, but instead the hard- 
ware and software that “flow” through the long-term ser- 
vice data*: though the service needs to remain the same, 
hardware becomes obsolete, operating systems evolve, 
communication standards grow, and cryptographic best- 
known methods are broken and replaced by their succes- 
sors. For example, a trusted logical component assumed 
to never fail would require the expensive proactive re- 
placement of the cryptographic libraries or the trusted 
hardware platform used to implement it, as new crypt- 
analysis techniques become possible, faster hardware is 
introduced, and new processes for protecting processor 
packages from physical or electrical tampering become 
available. In contrast, a less trusted component could af- 
ford to trail the state of the art and use replication or other 
techniques to mask faults, only migrating to new soft- 
ware and hardware less frequently and potentially at a 
lower cost. 

Finally, fault differentiation can come from limited 
exposure. Many high-assurance systems such as certifi- 
cation authorities keep their sensitive components (e.g., 
their signing keys) mostly or wholly off-line, limiting at- 
tack opportunities. Services that have limited or poten- 
tially batched updates but can be mostly read-only (or in- 
deed off-line with on-line, untrusted proxies [21,30,33]), 
can be protected quite effectively in this fashion. 

Interestingly, there are non-trivial dependencies 
among all these sources of differentiation. For example, 
a proven-correct system that is operated by a trustwor- 
thy organization is strictly more reliable than the same 
system operated by an unreliable organization. As goes 
the usual secure systems’ truism, a complex system is 
as secure as its weakest link. This simple observation al- 
lows us to argue that long-term fault models can usefully 
and realistically be constructed in which different system 
components belong in different fault tiers: in each tier, a 
different fault threshold can be assumed, though the jus- 
tification for that fault threshold might imply restrictions 
on the component capabilities for each tier. For example, 
if one were to argue that a component tier is afforded a 
low fault threshold thanks to its being formally verified, 
that component cannot be too complex: formal verifica- 
tion is still an extremely expensive proposition both in 
terms of human effort and computational resources [56]. 
Similarly, a tier whose fault threshold is justified by its 
remaining mostly off-line had better correspond only to 
functionality that the service can afford to perform peri- 
odically in batches. 


3 Bonafide 


Our target application in this paper is Bonafide, a key- 
value store designed to provide long-term integrity us- 
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ing replication in a tiered fault model. We are moti- 
vated to build a long-term key-value store not only as 
a case study for the system-building approaches we de- 
scribed earlier in the paper, but also because it is fun- 
damentally needed by archival storage systems such 
as Glacier [25], OceanStore [31], Pergamum [49], and 
Preservation DataStores [20]. Whereas such systems pro- 
vide durability (protection from data loss) and authen- 
ticity, they require data to be self-certified for their au- 
thenticity properties to hold: a client who needs to fetch 
a document from the archival system must have an au- 
thenticator such as a cryptographic hash of the docu- 
ment’s contents to ensure that what the service returns 
has not been modified; a client who does not have such 
an authenticator cannot obtain any authenticity guaran- 
tees from the service. We seek to create an archival ser- 
vice for providing indirection for precisely such authen- 
ticators: it can be used as a lookup service from a URL or 
a human-readable name to the random bits making up the 
authenticator, which can then be used to fetch the actual 
document from Glacier, Oceanstore, or systems similar 
to them. 

In the simplest case, a system like Bonafide of- 
fers a minimal interface: clients invoke Bonafide’s 
Add(key, value) method to store and preserve a partic- 
ular key-value pair, if no such key is already being pre- 
served, and the Get (key) — value method to obtain any 
stored key-value pair by that key, if one exists. The ser- 
vice is append-only. There is no method to remove or re- 
place an existing key-value pair. We use an append-only 
interface for simplicity; it is not a requirement. 


3.1 Design Rationale 


We apply the intuition behind the tiered fault framework 
by attempting to refactor the functionality of a service 
such as Bonafide into a more reliable fault tier for state 
changes and a less reliable fault tier for answering read- 
only state questions (i.e., Get requests). In keeping with 
the justification for distinct fault tiers, we make the re- 
liable, state-changing functionality mostly off-line, run- 
ning periodically to execute state changes in batch (for 
Add requests buffered but not executed during the mostly 
unreliable operation of the system). 

One challenge with such a high-level design, espe- 
cially when using commodity hardware, is the isolation 
between the reliable and the unreliable class of compo- 
nents. The Castro and Liskov approach punctuates a sys- 
tem’s timeline with periodic software refreshes, which 
can help bring a system whose faults are climbing to- 
wards the fault threshold back from the precipice. Unfor- 
tunately, that isolation goes away once the system has 
crossed the precipice; even if the number of faults is 
somehow reduced below the fault threshold again, data 
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Figure 2: Operation of Bonafide. Each Bonafide replica alter- 
nates between a service phase (S phase) and an update phase 
(U phase). 


changes before and after the fault violation cannot be iso- 
lated, causing the loss of safety and liveness guarantees. 

To address this challenge, we require our third, most 
reliable class of component in its own top fault tier: 
a trusted mechanism for protecting state during execu- 
tion of the unreliable class of components. This mecha- 
nism is not only extremely critical, but cannot even be 
mostly off-line; as a result, to ensure its fault thresh- 
old is plausible, we must make it extremely simple. For 
Bonafide’s top-tier component class we use a moded, at- 
tested storage (MAS) facility, akin to the sealed storage 
mechanism provided by modern trusted platform hard- 
ware [4]. This facility allows us to store reliably very 
small amounts of memory, only allowing the reliable 
stage-changing mechanism to update this storage. 

A second challenge is that our middle-tier, reliable 
state-changing mechanism must somehow be able to 
mask faults experienced by its components. We use 
a Byzantine-fault tolerant replicated state machine ap- 
proach to implement this middle-tier mechanism. How- 
ever, since this middle tier is mostly off, we require that 
all of the system’s nodes execute this tier in a synchro- 
nized fashion, which implies loose clock synchronization 
and a fairly long period for the execution of this tier. We 
describe how to relax the requirement for clock synchro- 
nization and synchronized execution of the state machine 
replication in Section 5.2. 

The final challenge is figuring out how to use our very 
limited attested storage to protect a potentially large ser- 
vice state. The approach we adopt is the use of an authen- 
ticated data structure, which allows the integrity of arbi- 
trarily large, structured data to be protected by a small 
cryptographic digest. 


3.2 Overview 


Bonafide is a replicated service running on replicas R = 
{1,...,N} (NV = 3f +1). The replicas operate in alter- 
nating synchronous phases of two types: a service phase 
(S phase) and a subsequent state-update phase (U phase) 
(Figure 2). During the z-th S phase, Get requests can 
query the service state committed (i.e., fetch bindings 
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Table 1: The components in Bonafide and their associated fault 
tiers. 


that were added) up to the (i — 1)-st U phase. Add re- 
quests are buffered and executed after the end of the 7-th 
S phase, during the 7-th U phase. In other words, service 
state changes occur in batch on/y during the U phase. 

Bonafide consists of three component classes (trusted 
storage, state update, and service), each of which belongs 
to a fault tier with a different fault threshold. Table 1 
summarizes the fault tiers in Bonafide. The state update 
component of a replica contains the state update process, 
OS, and hardware excluding the trusted top tier, and the 
service component of a replica contains the service pro- 
cess, OS, and hardware excluding the trusted top tier. 

In addition, Bonafide has the following standard, par- 
tial synchrony assumption for liveness. In the network, 
packet drops, reorderings, and duplications can occur but 
retransmissions of a message eventually deliver it. How- 
ever, though finite upper bounds exist for message deliv- 
ery and operation execution times, those bounds are not 
known to protocol entities. This is a standard network as- 
sumption for Byzantine-fault tolerant systems and is not 
unique to Bonafide. 

Under this tiered fault assumption, Bonafide guaran- 
tees service safety, that is, integrity of returned data. 
However, to guarantee durability as well (i.e., that no 
data are lost) the system should create copies of data 
faster than they are lost, as in systems such as Car- 
bonite [15]. Also, to ensure /iveness (i.e., non-starvation) 
S phases with at most 2/3 faulty replicas must occur 
once in a while (more precisely, within a finite number 
of phases at all points in time). This is to ensure that an 
Add request must be resubmitted by a client a finite num- 
ber of times before it is eventually served by a U phase. 

A Bonafide node contains a MAS as well as a buffer 
to hold Add requests temporarily and a main data 
structure that maintains committed bindings (Figure 3). 
In Bonafide, the service state—the key-value pairs—is 
maintained as a variation of a hash tree [37], which com- 
putes a cryptographic digest of the whole state from the 
leaves up, storing it at the tree root. The results of in- 
dividual state queries (i.e., key lookups in the tree) can 
be validated against that root digest; as long as the di- 
gest is kept safe from tampering, individual lookups can 
be performed by an untrusted service component with- 
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out risking an integrity violation. This state is replicated 
at each replica in the system in untrusted storage (bottom 
tier) but its root digest (of size on the order of 1 Kbit in 
today’s hardware) is stored in each node’s MAS. Each 
replica’s MAS module lies in its most trusted fault tier: 
we assume that while in service no MAS module returns 
contents other than those that were stored at it. We use a 
MAS for the root digest of the service state, since it cryp- 
tographically protects the integrity of any answers about 
that state provided by even an untrusted component. 

The service state is updated during the U phase, which 
is invoked by a trusted watchdog in the most trusted fault 
tier. In the U phase, all buffered writes are agreed upon 
by non-faulty replicas using a state machine replication 
protocol and then reflected in the service state, replacing 
the integrity digest in each replica’s MAS. The U phase is 
in the next most trusted fault tier in Bonafide: we assume 
that no more than a third of the replicas’ update software 
can fail simultaneously, to ensure that the state machine 
replication protocol safety and liveness guarantees can 
be upheld within a single U phase. 

The service state is served to clients during the S 
phase. Responses to Get/Add requests are accepted by 
clients when f + 1 replicas return to the client the same 
result, and each result is consistent with the correspond- 
ing replica’s service state digest in its MAS module. The 
f +1 number comes from the fault bound of the update 
tier, which assumes no more than f update processes can 
be faulty in any single U phase; as a result, no more than 
f update processes can put an incorrectly updated digest 
into their own MAS. If the same response to a client re- 
quest is provided by at least f + 1 untrusted service pro- 
cesses, but backed by the f +1 trusted state digests in the 
MAS, the client is guaranteed to be getting what at least 
one correct replica provides. At worst, the client will re- 
ceive no valid responses or obviously invalid responses 
from the replicas and try again. Also, the service state 
is audited (for latent storage faults or other bit loss) and 
repaired during the S phase. 

At the protocol level, Bonafide provides the follow- 
ing safety property. If an Add or Get operation collects 
f +1 matching replies from distinct replicas, the reply 
is correct. In other words, there is a serial schedule of 
committed bindings, and once a binding is committed, it 
is seen by clients. This is similar to the integrity guaran- 
tees of other long-term storage systems, but unlike them, 
Bonafide can take any key to “name” the sought data 
value, not only self-certified names. In this paper, we dis- 
cuss only Adds, but the safety property of Bonafide holds 
even when there are Removes or Replaces in the system 
API. 

In addition to safety, Bonafide provides the following 
liveness property. If an Add operation collects 2 f+ 1 ten- 
tative acknowledgments from distinct replicas, the bind- 


FAST ’09: 7th USENIX Conference on File and Storage Technologies 


Trusted storage 


Untrusted storage Update 
la] 


Update 
































Service 


Figure 3: A Bonafide node contains the following state shown 
in the middle of the figure: a MAS, a buffer to hold Add re- 
quests temporarily, and an AST that maintains committed bind- 
ings. The MAS stores the AST root digest, a sequence number, 
and a checkpoint certificate. The left side shows the get, add, 
audit/repair processes running during the S phase, and the right 
side shows the update process running during the U phase. The 
arrows show what state the processes access. 


ing is guaranteed to be committed during the U phase if 
there are at most f faults in the S phase during which 
the operation is invoked. If there are more than f faults 
in the S phase, the Add operation does not guarantee the 
binding to be committed during the U phase since faulty 
replicas can send fake tentative acknowledgments. 

A fundamental limitation of Bonafide is that it is not 
resistant to common denial-of-service attacks, such as 
name squatting. As in all existing archival systems we 
are aware of, faulty clients can insert arbitrary bindings 
into Bonafide, preventing legitimate clients from using 
those bindings. 

Next we detail the service state and component func- 
tionalities of Bonafide. 


3.3 Bonafide State 


In Bonafide, the service state (the collection of key-value 
bindings) is maintained as an authenticated search tree 
(AST). An AST [11] is an incremental mechanism for 
maintaining cryptographic digests over sorted data sets 
(such as key-value pairs sorted by key), extending the 
concept of a traditional Merkle tree [37] for search. Ev- 
ery node contains a key-value pair and an authentication 
label. The label for an AST node is computed by hash- 
ing together its content and the labels of all child nodes. 
The label of the tree root is a cryptographic digest for 
the entire contents of the tree: it is collision-resistant, 
which means it is intractable to find two different data 
sets yielding the same AST digest and, as a result, it can 
serve as a commitment on the contents of the AST [39]. 
As with Merkle trees, a succinct witness (sometimes 
also called a proof) can be generated showing that a par- 
ticular key-value pair appears within an AST with a root 
label. Unlike Merkle trees, an AST can also provide a 
succinct witness that a key does not appear within it. 
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Witnesses have logarithmic length in the number of the 
key-value pairs contained within a tree. 

Each Bonafide replica maintains an AST in typical 
(untrusted) storage containing its collection of key-value 
pairs sorted by key, a buffer of received but yet uncom- 
mitted client requests for adding new key-value pairs also 
in untrusted storage, and a MAS having two slots within 
its trusted hardware device (see Section 3.4). The MAS 
slot with identifier g, stores values of the form (s,r), 
where s is the latest AST digest and r is an integer se- 
quence number associated with a particular U phase. The 
slot with identifier g. stores a checkpoint certificate de- 
scribed in Section 3.6. Finally, replicas know each other’s 
public keys and hardware device public keys. 


3.4 Top Tier: Trusted 


Cryptography: We assume standard cryptographic 
primitives for digital signing and hashing. These prim- 
itives belong in the top tier of the fault model (as they do 
in virtually all systems research and practice). In keeping 
with our “rolling procurement” argument for this trust 
over long periods (Section 2.2), we assume that cryp- 
tography is replaced with up-to-date technologies, algo- 
rithms, and key sizes well before its compromise is even 
feasible let alone practical (see Section 5.2 for a sketch 
of how this is accomplished). For a managed system, this 
is a reasonable and plausible approach. 

For conciseness, we denote by (7); amessage M that 
is digitally signed by principal 2. Replica z’s trusted hard- 
ware device (i.e., its MAS principal) is principal 7’. If 14 
is not signed, there is no subscript in the notation. 
Trusted Hardware: Bonafide relies on the existence of 
a trusted hardware device on every replica in the sys- 
tem [8]. Today, such a device could be implemented in a 
programmable, tamper-resistant secure coprocessor such 
as IBM’s commercially available PCIXCC [8] board, but 
cheaper trusted alternatives are implementable with the 
trusted computing platforms coming from Intel’s (e.g., 
Intel TXT) and AMD’s (Presidio) pipelines. 

To help conduct periodic recovery operations, this de- 
vice contains a time source (this can be a regular, mono- 
tonic, crystal-based clock source with an upper bound on 
drift, or an external trusted time source received by the 
device such as GPS). A hardware watchdog, also con- 
tained within, uses this time source to trigger proactive 
recovery periodically, by causing the host to reboot from 
read-only media. This hardware watchdog sets a mode bit 
of the MAS associated with it. This bit is used to indicate 
that the system is in its U phase, and cannot be set in any 
fashion other than by triggering the watchdog. The mode 
bit can, however, be reset by the operating system. Typi- 
cally this is done during the U phase, while the software 
is still under the middle trust tier. During the S phase, the 
no longer trusted operating system can of course reset 


this bit, but bit resets are idempotent, so this misbehavior 
is ineffective. Such mode bits are sometimes called sticky 
registers. 

Finally, the device contains a MAS with a simple in- 
terface. A MAS contains a mode bit, and a set of storage 
slots, each of which is identified by an identifier g. The 
write interface to a MAS is Store(q,v) where v is a 
value; this stores v to the slot with identifier g. This in- 
terface allows requests only when the mode bit indicates 
a U phase is ongoing. The read interface of MAS allows 
access all the time. It allows the attested, fresh retrieval 
of any slot; a Lookup(q, z), where gq is a slot identifier 
and z is a nonce used for freshness (typically provided 
by clients), returns (LOoKuP, g, v, z,t,m);, where v is 
the value currently occupying the slot with identifier q of 
the MAS, ¢ is the internal time in the device, m is the 
current mode bit, and 7’ is the hardware device princi- 
pal. If the slot is empty, then v = Empty in the returned 
attestation. In our own recent work, we have introduced 
an Attested Append-Only Memory (A2M) [16] and we 
compare MAS with A2M in detail in Section 6. 
Membership Management: Bonafide is intended for a 
well provisioned, low-churn node infrastructure. Since 
membership churn is low, the membership of nodes can 
be managed manually. Membership management is con- 
ceptually also trusted, in the top tier of the fault model. 
We discuss how to extend Bonafide to automate mem- 
bership management in the middle tier by refactoring it 
with MAS in Section 5.2. 


3.5 Bottom Tier: Service Process 


In the S phase, each Bonafide replica runs an add/get 
process to serve client requests, and a continuous audit 
and repair process in the background for durability. Pseu- 
docode for the service process is given in Figure 4. 

Get: When a client c invokes Get(k) to retrieve a 
value of key k, its Bonafide stub (called a proxy be- 
low) multicasts (GET,k,z,c)- messages to R where 
z is a nonce used for freshness and waits for f + 
1 (REPLY, i, v, pi, (LOOKUP, qs, (S:, Ti), Z,t,m),) valid 
matching messages confirming that (k,v) is within the 
AST, or that (k,v) does not exist in the AST. Note that 
the attestation includes the nonce the client sent to ensure 
it does not accept a stale response. 

A replica handles a Get by looking it up by key in 
its local AST and producing an existence/non-existence 
witness, accompanied by its latest MAS attestation. 
Add Buffering: When a client c invokes Add(k, v) to in- 
sert a binding between key k and value v, the Bonafide 
client proxy code multicasts (ADD, k, v, z,c)- to R. The 
client waits for 2 + 1 tentative acknowledgments, each 
of which is a (TENTREPLY, k, vu, z,c); message where i 
is a replica identifier, from distinct replicas. If the client 
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CLIENT.GET(key) 
// quo_RPC sends msg to R, collects matching responses on non-* 
// fields from a quorum of given size, retransmits on timeout 
(REPLY, *, value, witness, *) — quo_RPC((GET, key), f +1) 
return value 

SERVER.GET(client, key); // this is server 7 
(value, witness) — lookup_AST(key) 
att — lookup_MAS (qs) // attestation 
send client a (REPLY, 2, value, witness, att) 


CLIENT.ADD(key, value) 
(TENTREPLY, *, key, value) — 
quo_RPC((ADD, key, value), 2f + 1) 
// at this point, the client holds a tentative reply 
collect REPLY messages // in the next S phase 
if (f + 1 valid, matching replies are collected) 
return accepted (key, value) 
SERVER.ADD(client, key, value); 
if ((key, value’) in AST), treat as a GET and return 
add (client, key, value) to Adds 
send client a (TENTREPLY, i, key, value) 


SERVER.AUDIT(AST Node, hastNode)i 
status — check AST Node, hagsTNode 
if (status invalid) repair AST Node _ // fetch from other 
for each child C of AST.Node 
AUDIT(C, hg) // hg is contained in the label of AS T'Node 


SERVER.START_SERVICE( Committed_Adds); 
// reply for ADDs committed in the previous U phase 
for each (key, value, client) in Committed_Adds 
send client a (REPLY, 2, value, witness, att) 


Figure 4: Simplified service process pseudocode. 


does not receive the responses within a timeout, it pre- 
sumes that the request has been dropped by the network, 
so it retransmits the request to the replicas that did not 
respond. Note that receiving 2 + 1 tentative acknowl- 
edgments is a hint that means the binding is likely to be 
committed. The client does not perform any operation 
that depends on the fact that the binding is committed 
and cannot be undone. Our liveness guarantee ensures 
that the client will receive a final commitment (see be- 
low) for each Add after receiving a finite number of un- 
committed 2 f + 1 such hints. 

The client also waits asynchronously for com- 
mit replies in MAS-attested messages of the form 
(REPLY, 2, v, pi, (LOOKUP, ds, (Si, Ti), Z,t,™m)i) (the at- 
testation is the result of a MAS Lookup). These mes- 
sages are sent by replicas in the beginning of the next 
S phase. A reply is valid if witness p,; verifies the ex- 
istence of the key-value pair (k,v) within an AST with 
digest s;, and the attestation is correctly signed by the 
sender’s MAS. As soon as the client proxy obtains f + 1 
valid matching replies from distinct replicas, all confirm- 
ing the addition of the same key-value pair, it accepts the 
request as complete and notifies the application. 

During the S phase, a replica only buffers ADDs and 
sends a TENTREPLY message back for each App. It han- 
dles the Apps during the U phase. The replica also re- 
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SERVER.UPDATESTART(); 
PBFT.Invoke((BATCH, i, stable_ckpt_cert, Adds) 4) 
SERVER.FINALIZE(); 


SERVER.EXECUTE(batch); // PBFT Execute callback 
append batch in batch_log 
on receiving the 2f + 1-st batch: 
choose the latest stable_ckpt from batch_log 
AllAdds <— the union of the Adds set from each batch 
for each (key, value, client) in AllAdds 
repair the AST path to this new key if needed 
insert key, value into AST 
insert (key, value, client) into Committed_Adds 
store _MAS(qs, AST RootDigest) 
multicast a UCHECKPOINT and flush batch_log 


SERVER.FINALIZE(); 
on receiving 2 + 1 matching UCHECKPOINTs: 
store_MAS(qc, stable_ckpt_cert) 
reset the watchdog timer and the mode bit 
begin a new S phase 


Figure 5: Update process pseudocode. 


turns the existing mapping for ADDs for already assigned 
names. During the next S phase, the replica responds to 
newly inserted ADDs with a REPLY message. The average 
(committed) response time for ADDs of new bindings is 
on the order of the S phase length. 

Audit and Repair: The audit and repair process ensures 
that all reachable AST nodes from the AST root are 
correctly stored. This process is recursive, starting with 
SERVER.AUDIT(AST Root, hagrroot) where has rRoot 
is the digest of the AST root and traversing the tree in 
order, during which a tree node is fetched from storage 
if still available and verified by computing its hash value 
and comparing it with the hash contained in the label of 
its parent node. 

For every missing AST node with digest h, replica 7 
multicasts a (REQASTNODE, i, h); request to R, waiting 
for at least one (RESPASTNODE, h, AST Node) response. 
The response need not be signed, since the replica can 
verify its integrity thanks to the recursive hashes of the 
AST. As long as the root digest remains in the trusted 
MAS, the rest of the AST nodes are self-certifying. 


3.6 Middle Tier: Update Process 


When the trusted watchdog timer expires, the system be- 
gins a reboot securely from a read-only medium of its 
proactive recovery software. The main responsibility of 
the U phase is to commit a new set of additions into the 
main service state. At the end of the U phase, the system 
ensures that at least 2 +1 replicas store the latest service 
state digest in MAS (see Figure 5 for the pseudocode). 
We use the PBFT protocol [14] to replicate the state 
machines of individual U phases, though any BFT state 
machine replication protocol would work. PBFT offers 
a synchronous Invoke(request) method, that returns a 
response. A PBFT client (which is a replica’s U phase 
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in our use of the protocol) uses this method to submit 
application requests—buffered Add requests—to repli- 
cas and eventually receives replies containing the result. 
PBFT also offers replicas an Execute(request) call- 
back, which invokes the application code that processes 
ordered client requests to be executed—the actual inser- 
tion of Added key bindings into the service state. 

All messages exchanged between replicas contain a 

fresh attestation fetched from the MAS after the current 
U phase began: the mode bit shown must be on, and the 
timestamp must be recent. Messages unaccompanied by 
this attestation are invalid and dropped. This is to ensure 
that update operations, including invocations of PBFT, 
are performed by nodes that have rebooted into their U 
phase. Any faults caused by such nodes are due to update 
process faults, which our middle fault tier bounds by f. 
Update Start: Each replica packages up its pending 
Apps (denoted by A) and the latest stable checkpoint 
(i.e., 2f + 1 matching MAS attestations of the previous 
round) (denoted by C';) obtained from the MAS slot q. 
into a (BATCH, i,C,, A); message, which it submits to 
PBFT’s Invoke. 
Application State Machine: The update state machine, 
executed on BATCH requests as ordered by PBFT, stores 
BATCH requests in order, until it receives the 2f + 1-st 
of them (from distinct replicas). Subsequent batches are 
ignored to ensure liveness (there is no way for replicas 
to know how many batches they can expect beyond the 
2f + 1-st). Note that if there are at most f faults dur- 
ing the S phase, there is at least one correct replica sub- 
mitting every request whose client received 2 + 1 ten- 
tative acknowledgments during the S phase, precluding 
starvation. As long as there are such S phases with no 
more than f faults from time to time, the system makes 
progress. 

In receiving the (2f + 1)-st batch, the application 
state machine picks the latest stable checkpoint, and 
the union of all Apps across all 2 + 1 batches. It or- 
ders the ADDs according to a consistent order (e.g., by 
h(k|lv||c)), verifies the client’s signature, and inserts all 
valid pairs into the AST in that order, ignoring keys 
that already exist. The replica computes the new AST 
digest s* for sequence number r* (= r + 1), stores 
it into the q, slot of its MAS, and multicasts to R a 
(UCHECKPOINT, (LOOKUP, ds, (37, 17;),t, ™)i) Message. 

When a replica receives 2f + 1 matching UCHECK- 
POINT messages, it stores the set at the MAS slot q. as 
a new stable checkpoint certificate. If a replica’s old ser- 
vice state is not the latest one, it will have to perform 
state transfer, as described below. 

When the replica obtains a new stable checkpoint cer- 
tificate, it resets its watchdog timer to D, which is the 
remaining time until the next U phase, and exits into its 
S phase by opening up communication with nodes other 


than replicas and resetting the mode variable. In the be- 
ginning of the new S phase, the replica sends REPLY mes- 
sages for all newly inserted ADDs as described earlier. 
State Transfer: Up-to-date replicas missing actual ser- 
vice state (e.g., because some of the AST nodes were 
corrupted) can apply the same repair process used dur- 
ing the S phase to obtain the AST nodes required for their 
batched Apps. 

Before the phase can end, the MAS of a replica 

must contain the latest stable checkpoint. A slow replica 
may be behind to obtain its checkpoint by executing 
the agreed-upon write operations. However, the stable 
checkpoint broadcast by those replicas that were up to 
date allows a slow replica to append that state digest into 
its MAS, thereby catching up with others and entering to 
the next S phase. 
Single-agreement Optimization: The design described 
requires at least 2 +1 PBFT invocations, one per BATCH, 
during every U phase. In the worst case, each invocation 
requires 3 network roundtrip times, potentially increas- 
ing the latency of the U phase tasks, which increases 
the minimum duration of the U phase, which in term re- 
duces the availability of the system. Instead, the update 
process can do preprocessing to create a PROPOSE mes- 
sage containing at least 2 + 1 BATCH messages and only 
submit that proposal to PBFT. This optimization dupli- 
cates the functionality of the PBFT primary by introduc- 
ing a leader to collect the BATCH messages. At the cost of 
greater complexity, this optimization can make use of all 
available BATCH messages, not just the first 2 f + 1 mes- 
sages generated by replicas, and also reduce the worst- 
case number of roundtrip times required. Our implemen- 
tation uses this optimization. 

Each replica packages up its pending Apps A and 
the latest stable checkpoint C’, obtained from the MAS 
slot q. into a (BATCH, 7,C',, A); message, filtering out 
those Apps for already assigned keys, and broadcasts the 
BATCH to f. Once a leader replica (defined below) col- 
lects at least 2 + 1 such messages including its own, it 
packages them into a PROPOSE message, which it submits 
to PBFT’s Invoke for Byzantine agreement. During the 
Execute callback of PBFT, a replica ensures the PRo- 
POSEd set contains at least 2 + 1 batches from distinct 
replicas. If so, it runs the function to update the service 
state, and the rest of the U phase is the same. 

During each U phase, the leader described above is the 
replica (J = r mod N), where r is the current U phase 
round number. A leader may misbehave, either by delay- 
ing the transmission of a PROPOSE message, or by trans- 
mitting an incorrect such message. The latter case can 
be detected during the Execute PBFT callback, as de- 
scribed above. A non-faulty replica can detect the former 
case by setting a timer as soon as it multicasts its BATCH 
message, which it uneventfully stops when it encounters 
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its own BATCH as one of the batches included in a pro- 
posal during the Execute callback; if the timer expires, 
then the replica initiates a leader change. To avoid un- 
necessary leader changes due to transient slowness, the 
replica does exponential backoff for consecutive leader 
change initiations. 

Leader change is similar to proposal formation: to ini- 
tiate change, a replica multicasts a LEADERCHANGE mes- 
sage, which the next leader (J + 1 mod N) listens for. 
When that leader has collected 2 + 1 such requests, it 
packages them into a single LEADERCHANGEREQUEST, 
which it submits to PBFT; execution of this request in- 
crements J, completing the leader change. Note that the 
leader role is similar but unrelated to the PBFT primary 
role; PBFT’s internal operation, including primary as- 
signment and view changes, is opaque to the U phase 
functionality. 





3.7 Correctness 


Under the tiered fault assumption, Bonafide provides the 
integrity property. Briefly, it is sufficient to show that any 
binding committed is guaranteed to be safe in the future 
(if returned, it is the correct binding) and live (if there 
are at most f faulty replicas, as long as 2f + 1 replicas 
have acknowledged receiving the ADD request, the bind- 
ing will be added). We show this by connecting what the 
client knows (2f + 1 tentative acknowledgments) to a 
starting condition for the U phase, and from there to the 
steps of the state machine replication. We defer the de- 
tailed argument to Appendix A. 


4 Experimental Evaluation 


In this section, we present the implementation of 
Bonafide and evaluate its performance. 


4.1 Implementation 


To validate our design, we developed a prototype 
Bonafide implementation. We implemented the add/get, 
background audit and repair, and the optimized version 
of the update process of Bonafide (excluding leader elec- 
tion) in C/C++ on Fedora Core 6. The client and server 
communicate with SFS’s asynchronous implementation 
of SUN RPC [47] in the sfslite library [3]. Client-server 
communications are authenticated by signatures; we use 
NTT’s ESIGN with 2048-bit keys. 

The client uses a proxy, its Bonafide local stub code, 
to perform Add/Get operations. The server maintains a 
MAS, an AST, and a log for buffering Apps. MAS is 
implemented as a library and it uses NTT’s ESIGN with 
2048-bit keys for signatures as well. We use SHA-1 as a 
secure hash function. 

For Byzantine agreement during the U phase, we use 
the PBFT library [14] ported to Fedora Core 6. PBFT 
uses MACs for message authentication. During update, 
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Operation || Time (ms) Data loss Time (s) 
Mean (std) (%) Mean (std) 
Get 3.1 (0.24) 0 554.5 (54.6) 
Add 1.0 (0.21) 1 612.9 (30.3) 
10 1147.6 (33.3) 
100 3521.5 (201.6) 








(a) (b) 
Table 2: (a) Get and Add time. (b) Audit and repair time. 


every node runs an update server and a PBFT repli- 
cated state machine. The leader’s update server creates 
a PROPOSE message and invokes PBFT agreement on 
the proposal to get consensus across the population on a 
hash of the proposal. When consensus is achieved, every 
replica fetches the proposal from the leader, and validates 
against the agreed upon hash. 

We store an AST and a log using Berkeley DB 
4.5.20 [1].> We use a binary AST to minimize the size 
of membership witnesses [35,58]. An AST is stored as 
a Berkeley database with a BTREE format. Each AST 
node is stored as a Berkeley DB record, which contains 
a key, a value, a hash of its left child, and a hash of its 
right child. The primary key of this DB is the key, and 
the secondary key is the hash of the entire node content. 
To search for a value given a key in the AST and to in- 
sert a (key, value) binding to the AST while generating 
a membership witness, we traverse the AST using sec- 
ondary keys. 


4.2 Performance 


We evaluate how the fault-tolerance improvement of 
Bonafide affects performance and availability. We ran 
our experiments with four Bonafide replica nodes and 
one client node. The nodes are outdated PCs with 
1.8GHz-—3.2GHz Pentium 4 processors, 1GB RAM and 
3Com 3C905C Ethernet cards. They are connected over 
a dual speed 10/100Mbps 3Com switch. On a 1.8GHz 
machine, ESIGN signature creation and verification of 
20 bytes take on average 25645 and 194j1s, respectively. 
We did not opt for a more up-to-date infrastructure since, 
by its nature, our application need not be deployed on 
the bleeding edge of hardware. As we see next, even 
with this obsolete collection of servers and switch, per- 
formance is adequate. 

Our experiments initially populate server ASTs with 
one million bindings of a 128-byte key to a 20-byte SHA- 
1 hash as the value. 

AppD/GET Time: We use a simple micro-benchmark 
client that sends 1000 App or GET requests. For ADDs, 
servers store bindings to their logs and return tentative 
acknowledgments. For Gets, servers search their ASTs 
and return values, AST witnesses, and MAS attestations. 

We measured Bonafide’s GeT and ADD response 

times, by averaging over 1000 requests of each types 
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Action Time (s) 
Mean (std) 
Reboot 86.6 (2.1) 
Proposal creation 8.0 (4.0) 
Agreement 5.2 (1.0) 
AST update/Checkpoint |} 271.1 (24.8) 
Total 370.9 (24.0) 








Table 3: U phase duration with 1000 new committed bindings. 


with randomly selected keys. In average, Get takes 
3.1ms, and App takes 1.0ms (Table 2(a)). GET takes 
more time than ADD does since it involves accesses along 
an AST path, which incurs multiple disk block accesses. 
There is a start-up effect in processing GET requests, 
roughly 100 requests’ long, while Bonafide caches top 
AST levels. 

Audit and Repair Time: We measured the average time 
of a basic audit that does not perform any repair over five 
runs. The disk drive we used was an IBM 40GB IDE disk 
drive with rotation speed 7200 rpm, average seek time 
8.5ms, and buffer size 2MB. The mean audit time of the 
entire AST is 554.5 seconds and the standard deviation 
is 9.9% of the mean. 

To measure audit and repair time, we simulate random 
data loss. We delete a fraction of AST nodes randomly 
at a Bonafide replica and run an audit process. When 
the audit process finds a lost AST node, the process re- 
pairs it synchronously by fetching the AST node from 
a randomly-chosen remote replica. Table 2(b) shows the 
mean audit and repair time when a fraction of AST nodes 
are lost. The more data loss, the longer the repair time 
due to more access to remote nodes. 

Note that our current prototype implementation is not 
optimized. Several optimizations can improve our proto- 
type performance. For example, more intelligent layout 
of stored key-value bindings may reduce the random disk 
access [58], thus improving audit time. Also, an audit and 
repair process can collect missing AST nodes by fetching 
them in parallel while performing auditing. 

U Phase Duration: We measured the duration of the 
U phase when 1000 new bindings were committed. Ta- 
ble 3 shows the mean and standard deviation of the U 
phase duration of the leader averaged over five runs; the 
leader has the highest computation and network band- 
width overhead. Proposal creation indicates time to col- 
lect a BATCH certificate and to create a PROPOSE message. 
Agreement indicates time to run a PBFT agreement, 
and AST update/Checkpoint indicates time to update the 
AST with new bindings and to execute remaining Com- 
MIT protocol. Several optimizations can improve the U 
phase duration. We can reduce reboot time, for example, 
by using fast boot from the LinuxBIOS project [2].The 
project claims three seconds boot time from power-on to 
Linux console. Intelligent caching of AST nodes may re- 
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Figure 6: Bonafide availability varying the U phase duration 
and period. Note that the y-axis starts at 0.95. 


duce the AST update time. 

Availability: Finally, we analytically show that Bonafide 
availability (the ratio of service time to service time plus 
update time) is high enough for varying U phase dura- 
tion and period in Figure 6. When the update period is 9 
hours, availability is 0.998 and 0.983 for one-minute and 
nine-minute U phase durations, respectively. Availability 
decreases linearly as update duration increases. In addi- 
tion, as we perform update more frequently (i.e., update 
period decreases), availability decreases more rapidly. 
For example, when update duration is nine minutes, 
availability drops from 0.983 to 0.950 as update period 
changes from 9 hours to 3 hours. However, when we per- 
form update frequently, its duration may decrease since 
fewer additions are collected, mitigating the effects of 
unavailability. With one-minute update duration, avail- 
ability becomes 0.994 despite three-hour periods. 


5 Discussion 


In this section, we discuss the tradeoffs between safety 
and availability and extensions of Bonafide. 


5.1 Safety, Availability, and Durability 


Our approach implies an operation model that trades off 
availability for safety, first by containing state changes 
during small portions of the system’s timeline, and by 
closing off access to the system by its clients while those 
state changes are incorporated. Though different appli- 
cations might fare differently with such a trade-off, we 
believe that applications like Bonafide, including also 
notarizing documents [23] and auditing for accountabil- 
ity [24,58] are appropriate practical candidates. 

Bonafide can have tradeoffs between durability and 
availability. This can be tuned with the frequency of up- 
date. Frequent update improves durability since it re- 
duces the probability of N replica faults in an S phase, 
but it reduces availability since the Bonafide service is 
not available to clients during update. 

Availability can be improved in two ways: (1) some- 
how removing safely the exclusion of service requests 
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during U phases and (2) increasing the frequency of U 
phases without halting the service process. 

First, it is possible to run the U phase at the same time 
as a S phase, if the processes executing each can be ad- 
equately isolated. For instance, a combination of virtu- 
alization and trusted execution (e.g., Intel’s LaGrande 
technology or AMD’s Presidio extensions) can ensure 
that a new operating system image can be late-launched 
(i.e., “booted”) in isolation of any currently running S 
phase software in a separate execution domain. While the 
U phase is running, the S phase executing in a separate 
domain can still handle requests for the previous snap- 
shot of the service. The same effect could be obtained in 
perhaps less complexity by separating the U and S phases 
into different physical machines. 

Second, it is possible to increase the update frequency 
without halting the service process by making update un- 
synchronized, that is, without requiring that all replicas 
enter the U phase at the same time. Without the need for 
clock synchronization among all replicas, the duration of 
U phases can be much shorter (since we no longer need 
to accommodate global bounds on clock drift across all 
nodes) and, as a result, U phases can be much more fre- 
quent. We give a sketch of an alternative design for un- 
synchronized U phases in the next section. 


5.2 Extensions 


{T-bound: The fault threshold for a single U phase in 
Bonafide can be extended to a multi-phase fT-bound 
model, in which the cumulative number of faults in 7’ 
consecutive U phases is bounded by fT’ for some frac- 
tion f, but there can be phases in which more than frac- 
tion f replicas are faulty. Such a failure model may re- 
quire multi-phase recovery and an extension of MAS 
to hold, instead of individual registers, an append-only 
queue with at least T positions, akin to an A2M [16]. 
Early Commitment: Bonafide does not guarantee that 
a mapping for which a client collects 2f + 1 tentative 
acknowledgments in any S phase is committed during 
the following U phase when there are more than f faults 
during the S phase. By extending MAS, we can provide 
early commitment that guarantees a mapping is commit- 
ted during the following U phase. The attested storage 
needs to bound the number of the entries appended dur- 
ing a single S phase. Once this storage reaches its bound, 
it does not accept more appends until it is flushed out 
during the next U phase. 

In early commitment, during the S phase, a replica ap- 
pends Apps to the bounded attested storage and sends 
a TENTREPLY message with a MAS Lookup attestation 
for each App. Essentially, the MAS is used as a trusted 
un-erasable App buffer during an otherwise untrusted 
S phase. As a result, unlike the protocol described ear- 
lier, when the client collects 2f + 1 tentative Add ac- 
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knowledgements containing buffering MAS attestations, 
it can complete the request immediately, since that ADD 
is guaranteed to be reflected in the next AST addition. 
Advanced Search: To focus on a tiered fault framework, 
we present a long-term key-value service with a mini- 
mal search interface. Extending the main data structure 
for advanced search is possible. For example, recent re- 
search shows a way for running generalized SQL queries 
on authenticated databases [21]. 

Caching for S phases: Bonafide can employ caching 
replicas that serve Get requests to increase throughput 
and availability. These caching nodes can serve requests 
continuously, i.e., they are available during both S and 
U phases. They use digital signatures that are created 
by Bonafide replicas to vouch for bindings whose fresh- 
ness is approximately guaranteed with timestamps. The 
caching nodes can grow and shrink dynamically depend- 
ing on workload without manual intervention. 
Unsynchronized U Phases: Our current design requires 
a synchronized execution of all U phases in the entire 
population, which requires bounds on the drift of all 
nodes’ clocks, which in turn requires a long U phase (to 
accommodate realistic clock drift bounds). As a result, U 
phases are infrequent to be mostly off-line. 

We are exploring an alternative design that does not 
require a synchronized execution of all U phases. At a 
high level, when a replica x in its U phase wishes to send 
a message m to another replica y that is not guaranteed to 
be in its U phase, replica x asks y’s S phase to store that 
message in an untrusted “mail box” for y’s subsequent U 
phase. When at a later time replica y enters its U phase, it 
checks its “mail box” for messages from other replicas’ 
U phases, which it uses to make progress. 

There are several challenges with this approach. First, 
the mail box is untrusted (it lives in the address space 
of the S phase and is subject to the bottom tier of the 
fault model) which means that messages stored in it may 
be lost or corrupted. Second, whereas our current, syn- 
chronized design executes an entire agreement protocol 
exchange within a single U phase, this unsynchronized 
design would have to take multiple rounds of U phases 
at all involved replicas to complete each agreement pro- 
tocol (one U phase at each replica to process all mes- 
sages in its mail box and to transmit the next set of mes- 
sages). On the other hand, given that there would be no 
need for clock synchronization, an unsynchronized de- 
sign can have more frequent but shorter U phases at each 
replica (say one-second-long U phases every few min- 
utes or so). We are currently adapting a protocol akin 
to Byzantine Disk Paxos [7], a shared-memory version 
of Byzantine Paxos that models well communication via 
unreliable mail boxes. 

Upgrades: Bonafide does not require that the crypto- 
graphic tools it uses (hash functions, digital signatures) 
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remain inviolate forever; as long as it is migrated to a 
new algorithm for hashing or signing before the old al- 
gorithm has been completely compromised, it can retain 
its guarantees. Upgrades require an agreement (via the U 
phase), using a special Upgrade request, handled simi- 
larly to Add requests (buffered, then committed, then ex- 
ecuted). Upgrades can include hardware upgrades (e.g., 
the migration of one MAS device on a particular replica 
to a newer device, updating the replica membership and 
public keys to all those who receive the upgrade), soft- 
ware upgrades (e.g., the installation at a replica of a soft- 
ware module for a new cryptographic function), regular 
membership updates (switching public keys or locations 
for a replica), or algorithms in use. The latter case re- 
quires that all replicas have the new software for a new 
algorithm; the system cannot migrate from RSA signa- 
tures to a (fictional) new RSA++ algorithm until at least 
a strong quorum of replicas speak RSA++. As a result, 
an Upgrade request to algorithm RSA++ executes only 
if all replicas in the membership list already have soft- 
ware for the algorithm; otherwise, the request is replaced 
by a no-op. For hash function upgrades, in particular, the 
service state must be upgraded as well. This can be done 
gradually, a small number of AST subtrees per U-phase, 
by replacing node labels of nodes from tree leaves up 
to the root. While this upgrade takes place, some tree 
nodes will have labels computed with the old algorithm, 
and some labels with the new algorithm, but that is not a 
problem, until the old algorithm is ultimately and com- 
pletely compromised. 


6 Related Work 
6.1 BET Systems 


Byzantine-fault tolerant state machine replication has 
received much attention in the systems community 
since PBFT [13]. Systems such as PBFT-PR [14] and 
COCA [59] have employed proactive recovery to reduce 
the vulnerability window. In these systems, a node is 
periodically rejuvenated, checking and repairing service 
state. COCA, in particular, shares a similar goal with 
Bonafide, i.e., the maintenance of a mapping from names 
to authenticators, but does not account for long-term op- 
eration in its structure or assumptions. 

Researchers have proposed few improvements on 
PBFT to improve the 1/3 fault bound. Like PBFT, 
BFT2F [34] provides safety and liveness with up to 
1/3 faulty replicas and then only fork* consistency (a 
weaker property than linearizability) when faults grow 
up to 2/3 of the population. Although closer in spirit 
to our work, since it acknowledges that multiple fault 
thresholds might be useful towards different guarantees, 
BFT2F requires state at the clients, which is unreason- 
able for long-term preservation services, and offers fork* 


consistency at its weakest fault threshold, which is inap- 
propriate for an archival lookup service. Finally, our own 
A2M-enabled BFT protocols [16] improve fault bounds 
by using A2M. In particular, A2M-PBFT-EA provides 
both safety and liveness with up to fewer than 1/2 faulty 
replicas. In this work we use some of the insights we 
gained in the A2M work, but focus instead on the notion 
of the tiered fault framework in a long-term service. 

A2M [16] is a trusted primitive that removes the abil- 
ity of faulty components to equivocate—tell different lies 
to different peers. Though a powerful primitive for BFT 
protocols, A2M is lacking MAS’s mode bit, which is 
critical for ensuring phase separations. In addition, A2M 
has more complicated internal structures and interfaces 
to account for linearizing requests and handling view 
changes. A2M has a set of trusted, undeniable, ordered 
logs, and it gives attestation of any entry or the last en- 
try in the log or attestation vouching for some sequence 
numbers are skipped. In contrast, MAS has a set of stor- 
age slots with a simple write/read interface. 

Recent work in Byzantine-fault tolerant storage sys- 
tems has focused on developing efficient erasure-coding 
based block storage protocols [12, 22, 26] that reduce 
storage overhead. Hendricks, Ganger, and Reiter [26] de- 
veloped the state-of-the-art BFT (m, n) erasure-coding 
block storage protocol to optimize reads and large writes. 
To tolerate f faults, the protocol requires m > f +1 out 
of n = m+2f servers. These block storage protocols do 
not differentiate components of the systems and are not 
designed for long-term operation. 


6.2 Differentiating Trust Levels 


Researchers have differentiated trust levels on system 
components, failure types, and failure thresholds. The 
wormholes model is a hybrid system model where the 
system is decomposed into payload subsystems with 
weak assumptions and wormhole subsystems with strong 
assumptions and the two communicate through worm- 
hole gateways [51,53]. Wormholes such as the Timely 
Computing Base (TCB) [52] and the Trusted Timely 
Computing Base (TTCB) [17, 18,41] provide concrete 
services such as timely execution and trusted block 
agreement to payload subsystems. TCB and TTCB are 
synchronous and fail by crashing. Similarly, we hy- 
bridize system components using tiers with different 
functionalities but we distinguish components explicitly 
for long-term operation and do in a finer granularity with 
different fault thresholds and in a more general way. 
Hybrid fault models differentiate failure types on ho- 
mogeneous systems: some nodes can have benign faults 
and others can have Byzantine faults [38,50]. Further- 
more, Byzantine faults are classified into malicious sym- 
metric and malicious asymmetric faults [50]. Modified 
versions of the classic agreement algorithms can lead to 
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more flexible fault tolerance guarantees [9, 29,50]. 

There has also been research on applying different 
fault thresholds to different sites or clusters. The multi- 
site threshold model differentiates two types of failures 
— site failures and process failures — in multi-site sys- 
tems [28]. The model uses a fault threshold for the num- 
ber of sites and a vector of fault thresholds, each of which 
is assigned to a site to account for a different process- 
failure probability depending on sites. In our tiered fault 
framework, each site is a tier. Yin et al. [57] proposed 
an architecture that separates execution from agreement: 
two groups of replicas—N agreement and / execution 
replicas—by dividing functionalities. This architecture 
can tolerate [44 faults and | “4 | faults, thus assign- 
ing different thresholds for the clusters. This partition is 
done based on functionalities. In our framework, each 
cluster is a tier. In Bonafide, we differentiate components 
based on functionalities but do at a finer level. 


6.3 Long-term Stores 


Self-certifying bitstore systems such as Glacier [25], 
PAST [44], OceanStore [31], Carbonite [15], and Antiq- 
uity [55] have addressed durability comprehensively. Au- 
thenticity is addressed by expecting all stored data to be 
self-certifying: the name of the datum is an authenticator 
for that datum, and can be used to verify its contents (e.g., 
via a cryptographic hash). However, such systems leave 
out of scope where those authenticators come from. It is 
precisely this gap that Bonafide seeks to fill: providing a 
long-term store for non-self-certifying information. 

The LOCKSS system [36] is a digital preservation sys- 
tem not requiring an inviolable 1/3 fault bound. How- 
ever, this system is probabilistic in nature and does not 
provide hard safety or liveness guarantees as Bonafide 
does. POTSHARDS [48] is a long-term storage system 
that relies on multiple separately-managed archives. It 
uses secret splitting and stores shares into the archives to 
prevent accidental disclosure. Each object has a mapping 
between its object identifier and a hash for integrity. In 
POTSHARDS, each replica in an archive site is trusted 
or untrusted in its entirety. In comparison, in our work, 
only a small component at each replica is trusted (MAS), 
the update software can fail at up to a third of the popula- 
tion at the same time, and the service software can briefly 
fail everywhere at the same time without affecting safety. 

CATS [58] is a single-server service that provides 
strong accountability of actions done by the server in a 
single authority and clients. Its approach is not to mask 
faults through replicated servers, but to detect faults 
and punish actors responsible for the faults. Its auditing 
scheme catches server rollback attacks probabilistically. 
In comparison, Bonafide provides hard safety and live- 
ness guarantees under its fault assumption and considers 
replicated servers. 
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7 Conclusion 


Long-term services that operate reliably are hard to con- 
struct. This work represents a step towards understanding 
better system structuring for long-term services that can 
lead to safer solutions. We present a tiered fault frame- 
work that partitions system components of nodes in dif- 
ferent tiers, each enjoying a different fault threshold. We 
have designed and implemented Bonafide, a long-term 
key-value store that provides integrity under a three-tier 
Byzantine fault model. We hope that our work provides a 
framework for building more dependable long-term stor- 
age services. 
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Notes 


1We use the terms fault bound and fault threshold interchangeably. 

2The window of vulnerability varies depending on system condi- 
tions. For example, if some replicas’ state is corrupted, the window 
becomes large. 

3This metaphor is usually attributed to Reagan Moore. 

4With our recent A2M-PBFT-EA protocol [16], we can improve this 
fault bound from 1/3 to 1/2. We leave the details out of this paper to 
keep the exposition simple. 

>We chose Berkeley DB since it was readily available, but our de- 
sign would be compatible with any block-store such as Venti [43]. 


A Correctness Arguments 


In this appendix, we prove that Bonafide provides the in- 
tegrity property under the tiered Byzantine-fault model. 
In the proof, we denote by s(r) the S phase of round r 
and by p(r) the U phase after s(7). Without loss of gen- 
erality consider a binding (k, v). 


Lemma A.1. For every Add(k,v) request accepted by 
2f + 1 replicas during a S phase in which there is no 
more than f faulty replicas out of 3f + 1 total replicas, 
(k, v) appears in any valid BATCH certificate. 


Proof. We say an Add(k,v) request is accepted if there 
are at least 2 +1 replicas that receive the request; a client 
can ensure that the request is accepted by checking au- 
thenticated tentative ADD responses. Let @,, denote this 
set of replicas. At the start of p(r), each replica multi- 
casts a BATCH message to other replicas. The leader col- 
lects 2f + 1 distinct BATCH messages that form a BATCH 
certificate. Let @, denote the set of replicas that form 
this certificate. Qa (] Q» includes at least one non-faulty 
replica that receives the ADD request since in the S phase 
the number of faulty replicas is no more than f. There- 
fore, the accepted request is contained in the BATCH cer- 
tificate. 














Lemma A.2. A stable checkpoint certificate of the pre- 
vious round appears in any valid BATCH certificate. 


Proof. We show that at p(r) the BATCH certificate con- 
tains the stable checkpoint (2 f + 1 matching MAS attes- 
tations) of p(r — 1). At p(r — 1), there are at least 2 +1 
replicas, each of which creates a stable checkpoint cer- 
tificate and puts the certificate to its MAS. Let Q, de- 
note this set of replicas. Q, (]@Q» intersects at at least 
f +1 replicas and hence includes at least one common 
non-faulty replica between two quorums. This replica en- 
sures that the stable checkpoint of the previous round is 
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included in the BaTcH certificate. Therefore, the BATCH 
certificate of p(1) contains the correct stable checkpoint 
of p(r — 1). 














Theorem A.3. [fa binding (k, v) is accepted at s(r) and 
k; is not in the AST, the binding is correctly read (or tem- 
porarily unavailable) at all s(r’)(r' > r). 


Proof. From Lemmas A.1 and A.2, we know that a PRo- 
POSE message with a valid BATCH certificate contains a 
correct stable checkpoint certificate of the previous round 
and bindings received during a S phase in which there is 
no more than f faulty replicas. When the leader invokes 
PBFT with the PROPOSE message, PBFT ensures that all 
non-faulty replicas agree on the PROPOSE message. Each 
such replica checks that /; does not exist; if necessary, the 
replica may perform state transfer for this validation. If 
k; does not exist, the replica inserts (k, v) into the AST, 
computes a new AST digest, and appends it to MAS. Fi- 
nally, each replica creates a stable checkpoint certificate 
by collecting 2,f + 1 matching UCHECKPOINT messages 
and appends the certificate to its MAS. 

Now, suppose a client gets a reply certificate (f + 1 
matching MAS attestations) of Get (k) at s(r + 1). The 
reply certificate contains at least one up-to-date replica 
since a non-faulty replica enters s(r + 1) only after cre- 
ating a stable checkpoint certificate. Therefore, a client 
correctly reads value v when it queries with & at s(r +1). 

Once (k, v) is inserted into Bonafide at p(1), it is clear 
that p(r + 1) carries (k,v) from p(r) correctly with the 
same argument we make for p(r — 1) and p(r) above 
since we have a correct AST digest. We can inductively 
argue the same holds for p(r +7) and p(r +7+ 1) for all 
4 = 0. Therefore, when a client gets a reply certificate for 
Get(k) at all s(r +7) (¢ > 0), the client receives correct 
(k, v). 
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Abstract 


Micro-recovery, or failure recovery at a fine granu- 
larity, is a promising approach to improve the re- 
covery time of software for modern storage systems. 
Instead of stalling the whole system during failure 
recovery, micro-recovery can facilitate recovery by a 
single thread while the system continues to run. A 
key challenge in performing micro-recovery is to be 
able to perform efficient and effective state restora- 
tion while accounting for dynamic dependencies be- 
tween multiple threads in a highly concurrent en- 
vironment. We present Log(Lock), a practical and 
flexible architecture for performing state restoration 
without re-architecting legacy code. We formally 
model thread dependencies based on accesses to both 
shared state and resources. The Log(Lock) execu- 
tion model tracks dependencies at runtime and cap- 
tures the failure context through the restoration level. 
We develop restoration protocols based on recov- 
ery points and restoration levels that identify when 
micro-recovery is possible and the recovery actions 
that need to be performed for a given failure con- 
tert. We have implemented Log(Lock) in a real en- 
terprise storage controller. Our experimental eval- 
uation shows that Log(Lock)-enabled micro-recovery 
is efficient. It imposes < 10% overhead on normal 
performance and <35% overhead during actual re- 
covery. However, the 85% performance overhead ob- 
served during recovery lasts only siz seconds and re- 
places the four seconds of downtime that would result 
from a system restart. 


1 Introduction 


Enterprise storage systems serve as repositories for 
huge volumes of critical data and information. Un- 
availability of these systems results in losses amount- 
ing to millions of dollars per hour [1], bringing orga- 
nizations to a grinding halt. 

Most existing work in the domain of storage sys- 
tem availability addresses failures of the storage me- 
dia (such as disks) and recoverability from these fail- 
ures [2, 3, 4]. However, failures at the firmware 
layer that result in service loss remain largely un- 
addressed. At the same time, the software at 
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the firmware layer of a storage system has evolved 
tremendously in terms of functionality. Modern 
storage controllers are highly concurrent embedded 
systems with millions of lines of code [5, 6]. As a 
result of this complexity, recovering from controller 
failures is both difficult and expensive. 

While system availability requirements are con- 
stantly being driven higher, failure recovery time is 
increasing due to increasing system size, higher per- 
formance expectations, virtualization and consolida- 
tion. Since software failure recovery is often per- 
formed through system-wide recovery, the recovery 
process itself does not scale with system size [6, 7, 8]. 

How can failure recovery be made scalable? Par- 
titioning the system into smaller components with 
independent failure modes can reduce recovery time. 
However, it also increases management costs and 
decreases flexibility, while still being susceptible to 
sympathetic failures. On the other hand, refac- 
toring the software into smaller independent com- 
ponents in order to use techniques such as micro- 
reboots [8] or software rejuvenation [9] requires siz- 
able investments in terms of development and test- 
ing costs, unacceptable in the case of legacy sys- 
tems. An alternative approach is to be able to 
perform fine-granularity recovery or micro-recovery, 
without re-architecting the system. Under this ap- 
proach, failure recovery is targeted at a small subset 
of tasks/threads that need to undergo recovery while 
the rest of the system continues uninterrupted. 

Enabling fine grained recovery can be challenging, 
especially in legacy systems, and the following issues 
must be addressed: 


e Evaluating recovery success: What are the 
failures that can effectively and efficiently be re- 
covered from, using micro-recovery? 


e Determining recovery actions: What are 
the recovery strategies and recovery actions that 
must be performed in order to restore the system 
from an error state to an error-free state? 


e Identifying dependencies: Given the large 
number of dynamic dependencies possible in a 
highly concurrent system, what is the scope of 
fine-granularity recovery? 
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e Enhancing recovery success and effi- 
ciency: How can we enhance the system to fa- 
cilitate better recovery success and efficiency? 


We address the first three questions, focusing on 
the challenges of tracking and restoring system state 
during micro-recovery, evaluating the possibility of 
recovery success and determining recovery actions. 

We make two unique contributions in terms of 
effective state restoration during micro-recovery. 
First, by analyzing the system state space, we iden- 
tify the set of events and system states that af- 
fect state restoration from the perspective of micro- 
recovery. We introduce the concepts of Restoration 
levels and Recovery points to capture failure and re- 
covery context and describe how to flexibly evalu- 
ate the possibility of recovery success. Based on the 
restoration levels and recovery points, we introduce 
Resource Recovery Protocol (RRP) and State Re- 
covery Protocol (SRP), which provide rules to guide 
state restoration. 

Our second contribution is Log(Lock), a practi- 
cal and lightweight architecture to track dependen- 
cies and perform state restoration in complex, legacy 
software systems. Log(Lock) passively logs system 
state changes to help identify dependencies between 
multiple threads in a concurrent environment. Uti- 
lizing this record of state changes and resource own- 
ership, Log(Lock) provides the developer with the 
failure context necessary to perform micro-recovery. 
Recovery points and their associated recovery han- 
dlers are specified by the developer. Log(Lock) is 
responsible for tracking dependencies and comput- 
ing restoration levels at runtime. 

We have implemented and evaluated Log(Lock) 
in a real enterprise storage controller. Our exper- 
imental evaluation shows that Log(Lock)-enabled 
micro-recovery is both efficient (<10% impact on 
performance) and effective (reduces a four second 
downtime to only a 35% performance impact last- 
ing six seconds). In summary, micro-recovery with 
Log(Lock) presents a promising approach to improv- 
ing storage software robustness and overall storage 
system availability. 


2 Log(Lock): Design Overview 


This section gives an overview of the Log(Lock) sys- 
tem design. We first describe the problem state- 
ment that motivates the Log(Lock) design. Using 
examples, we highlight the unique characteristics of 
micro-recovery in the context of highly concurrent 
storage controller software. Then we outline the 
technical challenges for systematic state restoration 
during micro-recovery. Finally, we briefly describe 
the system architecture of Log(Lock). 
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2.1 Motivation 


In this section, we motivate the need for a flexible 
and lightweight state restoration architecture using 
a highly concurrent storage controller. The stor- 
age controller refers to the firmware that controls 
the cache and provides advanced functionality such 
as RAID, I/O routing, synchronization with remote 
instances and virtualization. In modern enterprise- 
class storage systems, the storage controller has 
evolved to become extremely complex with millions 
of lines of code that is often difficult to test. The 
controller code typically executes over an N-way 
processing complex using a large number of short 
concurrent threads (~20 million/minute). While 
the software is designed to extract maximum con- 
currency and satisfy stringent performance require- 
ments, unlike database transactions it does not ad- 
here to ACID (atomicity, consistency, isolation and 
durability) properties. This software is representa- 
tive of a class expected to sustain high throughput 
and low response times continuously. 

With this architecture, when one thread encoun- 
ters an exception that causes the system to fail, the 
common way to return the system to an acceptable, 
functional state is by restarting and reinitializing the 
entire system. While the system reinitializes and 
waits for the operations to be redriven by a host, 
access to the system is lost contributing to down- 
time. As the system scales to larger number of cores 
and as the size of the in-memory structures increase, 
such system-wide recovery will no longer scale [6, 8]. 

Many software systems, especially legacy systems, 
do not satisfy the conditions outlined as essential 
for micro-rebootable software [8]. For instance, even 
though the storage software may be reasonably mod- 
ular, component boundaries, if they exist, are loosely 
defined and components are stateful. Under these 
circumstances, the scope of a recovery action is not 
limited to a single component. 

The goal of micro-recovery is to perform recovery 
at a fine granularity such as at the thread-level, while 
determining the scope of recovery actions dynami- 
cally, based on dependencies identified at runtime. 
The key challenges in performing micro-recovery are 
identifying dependencies based on failure and recov- 
ery context, determining recovery actions and restor- 
ing the system to a consistent state after a failure. 


2.2 Examples 


We present three real examples from a storage con- 
troller software. We demonstrate how the semantics 
and success of fine-grained recovery are determined 
by failure context and the interactions of threads. 
Figure 1 shows two code snippets: R1 increments 
the number of active users before performing work 
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R1: /* Increment number of Users */ 
lockWrite( &numActiveUsersLock); 
numActiveUsers ++; 
unlockWrite( &numActiveUsersLock); 


/* Decrement number of Users */ 
lockWrite( &numActiveUsersLock); 
numActiveUsers --; 

unlockWrite( &numActiveUsersLock); 








R2: /* Start background tasks if no users active */ 
lockRead ( &numActiveUsersLock); 
if (numActiveUsers == 0 ) { 
Start performing background tasks. 
} 


unlockRead( &numActiveUsersLock); 











Figure 1: Lost Update Conflict 





R3: /* Get cache track to write to fast-write cache */ 
startSCSICmd(); 
processRead(); 
L, getCacheTrack(); 
getTempResource() { 


PANIC 











Figure 2: Resource Ownership Conflict 


and in R2, a background job is triggered when there 
are no active users in the system. When a panic 
(user defined or system failure/exception) occurs 
during the execution of region R1, then assume that 
the micro-recovery strategy is to reattempt execu- 
tion of region R1. The recovery action must ensure 
clean relinquishing of resources such as the lock nu- 
mActiveUsersLock. It is also important to ensure 
that the system state is consistent since corruption of 
the counter can either cause the background jobs to 
never be triggered or to be triggered in the presence 
of active users. In Example-1, while it is permissible 
for other threads to read the value of the numAc- 
tiveUsers count at anytime provided the numAc- 
tiveUsersLock has been released, the system must 
ensure that if and only if a thread fails after perform- 
ing an increment operation on the count, a decre- 
ment operation is performed during recovery. On 
the other hand, if the failure was caused during the 
execution of region R2, an idempotent background 
task that is not critical, the recovery strategy may 
be to just abort the current execution of the back- 
ground task. However, recovery must ensure that 
the lock numActive UsersLock has been released. 
Figure 2 shows the processing of a write com- 
mand. In the event of encountering a failure, state 
restoration must ensure that temporary resources 
obtained from a shared pool are freed correctly in 
order to avoid resource leaks or starvation. It may 





R4: /* Update Metadata Location */ 
lockWrite( &MetadataLocationLock); 
MetadataLocation = XX; 
unlockWrite( &MetadataLocationLock); 











Figure 3: Dirty Read Conflict 


also require that certain cache tracks are checked 
for consistency, depending upon the point of failure. 
However, for a resource such as a buffer or empty 
cache track obtained from a shared pool, restoring 
the previous contents is not necessary. 

Figure 3 shows a thread that updates a global 
variable indicating the metadata location, such as 
for checkpoint activity. In the event of a failure 
caused due to a failed location, the thread may have 
the opportunity to modify the location without no- 
tifying other threads or causing inconsistency, pro- 
vided no other thread has already consumed the 
value. However if that is the case, the system may 
have to resort to recovery at a higher level. 

These examples highlight the fact that consis- 
tency requirements for state restoration vary with 
failure context. For example, in the case of a counter 
generating unique numbers, the only requirement 
may be that modifications are monotonous. For a 
shared resource, the state remains consistent as long 
as there are no resource leaks that could eventually 
lead to starvation and system unavailability. Unlike 
a transactional system, where similar problems are 
addressed, the semantics of the state and failure may 
render certain types of conflicts irrelevant from the 
perspective of system recovery. This emphasizes the 
need for a flexible state restoration architecture that 
is also lightweight and efficient, thereby allowing the 
system to sustain high performance. 


2.3. Failure Model 


Our work is targeted at transient failures in the 
system, especially failures where the developer now 
uses system restart as a method to take the system 
from an unknown or faulty state to a known state. 
A number of such failure scenarios occur in stor- 
age controller software and may apply equally well 
to other software systems. We present some exam- 
ples from our analysis of storage controller failures. 
Bad input from administrator or user, insufficient 
error handling, deadlocks, a faulty communication 
channel, unhandled race conditions, boundary con- 
ditions, and timeouts are some examples of such fail- 
ures seen in storage controllers. The system restart 
mechanism is used often because the system has in- 
sufficient information, for example, when reacting 
to an asynchronous event or when dealing with an 
unknown state or receiving an unexpected stimulus. 
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For example, consider a failure scenario where a 
write operation to disk fails because a driver from a 
third party vendor returns an unidentified error code 
due to a bug in its code. In this case, since writes are 
buffered in a fast write cache and the actual write to 
disk is performed asynchronously, dropping the re- 
quest is not an option. Another example is a config- 
uration issue that appeared early in the installation 
process that may have been fixed by trying various 
combinations of actions that were not correctly un- 
done. As a result the system finds itself in an un- 
known state that manifests as a failure after some 
period of normal operation. Such errors are diffi- 
cult to trace, and although transient may continue 
to appear every so often. 

Some transient failures can be fixed through ap- 
propriate recovery actions that may range from 
dropping the current request to retrying the oper- 
ation or performing a set of actions that take the 
system to a known consistent state. Some other ex- 
amples of such transient faults that occur in stor- 
age controller code are: (1) An unsolicited response 
from an adapter - An adapter (a hardware compo- 
nent not controlled by our microcode) sends a re- 
sponse to a message which we did not send - or do 
not remember sending; (2) Incorrect Linear Redun- 
dancy Code (LRC): A control block has the wrong 
LRC check bytes, for instance, due to an undetected 
memory error; (3) Queue full condition: An adapter 
refuses to accept more work due to a queue full con- 
dition. In addition, there are other error scenarios 
such as violation of service level agreements. The 
‘time-out’ conditions are also common in large scale 
embedded storage systems. While the legacy sys- 
tem grows along multiple dimensions, the growth is 
not proportional along all dimensions. As a result 
hard-coded constant timeout values distributed in 
the code base often create unexpected artificial vio- 
lations. For a more detailed classification of software 
failures, please refer to [6]. 


2.4 Technical Challenges 


With software recovery, state restoration actions de- 
pend on the actions of the failed thread and its in- 
teractions with state and shared resources. 

Threads in the system interact in two fundamen- 
tal ways: (1) reading/writing shared data and (2) ac- 
quiring/releasing resources from/to a common pool. 
Threads also interact with the outside world through 
actions such as positioning a disk head or sending a 
response to an I/O. Often these actions cannot be 
rolled back and are referred to as outside world pro- 
cesses (OWP) [10]. In such a system, state restora- 
tion and micro-recovery must consider the sequence 
and interleaving of the actions of concurrent threads 
that gives rise to the following conflicts: 
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e Dirty Reads (Write-Read Conflict): Data 
written by the failed thread has already been 
consumed by another thread. 


e Lost Updates (Write-Write Conflict): 
Rolling back the failed thread may cause the up- 
dates of other threads to be overwritten or lost. 


e Unrepeatable Reads (Read-Write Con- 
flict): The value of the shared state variable 
required by the failed thread has already been 
overwritten. 


e Resource Ownership : The failed thread may 
continue to be in the possession of resources from 
a shared pool or may be holding a lock resulting 
in resource leaks or starvation issues. 


The above taxonomy is derived from that used to 
describe concurrency control concepts in transaction 
processing systems [11]. For a given failure, the set 
of recovery actions that need to be performed to re- 
turn the system to a consistent state may vary de- 
pending upon the failure and the occurrence of one 
or more of the above conflicts. Note that for appli- 
cation state, the intention is not to deterministically 
replay the events before the failure, or recover the 
application state to exactly as it was at the instant 
of failure. Rather, the goal is to restore the system 
to an error-free state. In fact, the recovery strat- 
egy may itself explicitly rely on non-determinism 
to remove transient failures. For example, Rx [12] 
demonstrates an interesting approach to recovery by 
retrying operations in a modified environment using 
checkpointed system states for rollbacks. 


Checkpointing for fault-tolerance is a well known 
technique [10, 12, 13, 14] that has also been ap- 
plied to deterministic replay for software debug- 
ging [15, 16, 17]. However, checkpointing tech- 
niques are mostly targeted at long-running applica- 
tions [10] such as scientific workloads [13], or ap- 
plications where the system can tolerate the over- 
head imposed by checkpointing [12, 14]. A number 
of unique challenges in the case of storage controller 
software make checkpointing infeasible: Unlike long- 
running applications, storage controllers have a high 
rate of short (< 500usecs) concurrent threads and 
are designed to support extremely high throughput 
and low response times. Given the highly concur- 
rent nature of controllers, both quiescing the system 
in order to take the checkpoint, as well as logging 
the tasks in order to re-execute work beyond the 
checkpoint is expensive in terms of time and space - 
especially since system state includes large amounts 
of metadata and cached data. Next, communication 
with OWPs such as hosts and media cannot be rolled 
back and hence invalidates checkpoints. Finally, due 
to the complexity of the code, not all failures will be 
amenable to micro-recovery, making checkpointing 
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Figure 4: Log(Lock) Architecture Overview 


too heavy weight. 

State restoration and conflict serialization is also 
of interest to transactional systems [18]. Transac- 
tional databases use schemes like strict 2-phase lock- 
ing (2PL) to guarantee conflict serializability [19]. 
However, such techniques can increase the length of 
critical sections (i.e. durations of locks) and are in- 
efficient for the highly-concurrent storage controller 
environment. Moreover, we show in Section 2.2 that, 
recovery actions are determined based on both the 
context and semantics of failure and a “one size fits 
all” serializability, while simplifying recovery proce- 
dures, can constrain the recovery process. 


2.5 System Architecture 


The Log(Lock) architecture provides support for 
state restoration during micro-recovery. To achieve 
this goal, Log(Lock) tracks resources and state de- 
pendencies relevant to a thread that has incorpo- 
rated recovery handlers for micro-recovery. 

Figure 4 presents an overview of our system ar- 
chitecture and describes the roles played by the 
Log(Lock) execution model and restoration proto- 
cols. The figure shows a system with concurrently 
executing threads where the thread depicted by a 
solid line incorporates micro-recovery mechanisms. 
In order to facilitate micro-recovery, the thread sets 
recovery points during execution, where each recov- 
ery point is associated with a recovery criterion. The 
recovery criterion specifies the conditions that must 
be satisfied by the failure context in order to use the 
recovery point as a starting point for recovery. Us- 
ing the Log(Lock) architecture, the thread (depicted 
by a solid line) enabled with micro-recovery mecha- 
nisms indicates state and resources that are relevant 
to recovery. Log(Lock) then begins logging all rele- 
vant changes and dependencies, based on the actions 
of both this thread and other concurrent threads (de- 
picted by dotted lines). 

In the event of a failure, control transfers to a 
developer specified recovery handler. The handler 


performs state restoration actions by utilizing the 
resource tracking and state dependency information 
provided by the Log(Lock) execution model, in con- 
sultation with the restoration protocols. It also de- 
cides on an appropriate recovery strategy such as 
rollback, error compensation or system-level recov- 
ery. The implementation of the Log(Lock) depen- 
dency tracking component must ensure efficiency 
during normal operation while the recovery proto- 
cols ensure consistency of state restoration during 
failure recovery. Below, we summarize the four pri- 
mary design objectives of Log(Lock): 
e Incremental: Allow micro-recovery to be ap- 
plied incrementally to handle failures depending 
upon effectiveness of a fine-grained approach. 


e Lightweight and Non-intrusive: Minimize 
impact on system performance and modifications 
to legacy software functional architecture. 


e Dynamic: Handle dynamic dependencies. 


e Flexible: Allow application developers the flex- 
ibility to treat different failures differently with- 
out enforcing a “one size fits all” consistency re- 
quirement, allowing a larger number of failures 
to be handled correctly at a fine-granularity. 


In the next two sections, we first describe the con- 
cepts of ‘restoration levels’ and ‘recovery points’ and 
present the restoration protocols. Then, we present 
the Log(Lock) execution model and illustrate appli- 
cation of the protocols through example scenarios. 


3 State Space Exploration 


In this section, we model failure scenarios and recov- 
ery contexts using a state space analysis approach. 
Our approach is based on the intuition that in a con- 
current system, global state and shared resources are 
often protected by locks or similar primitives. 

This section is divided into two parts. In the 
first part, we model system events, state transitions 
and interleaving of concurrent threads and demon- 
strate the discrete state space and recovery scenar- 
ios. We introduce the concepts of Restoration Level 
and Recovery Criterion, that help match a failure 
context to a recovery strategy. In the second part, 
we systematically identify the set of recovery strate- 
gies that can be applied to each failure scenario and 
present two protocols for state restoration. The Re- 
source Recovery Protocol (RRP) defines the 
steps to handle resource ownership conditions and 
the State Recovery Protocol (SRP) sets forth 
the rules to perform state restoration. 


3.1 Modeling Thread Dependencies 


Let T = {T,|1 <i < n} define a system with n con- 
current threads. Let %;(t) denote the sequence of 
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Table 1: Valid States for Thread T; 





To determine the right strategy for recovery, it is 





Notation | Description 


| important to determine which of the above conflicts 





TiS T, initial state 
T,R T; holds a read lock 
T;W T, holds an exclusive write lock 


T,U T, has released the lock 








TiF T; is in failed state 

TiA T; acquired a resource 

T; Re T, released a resource 

Ti B T, performed an externally visible action | 


| have occurred and are relevant to recovery. 


| Restoration Level: The restoration level R;(t) 
| of a thread J; at instant t, is a 5-tuple 
| (DR, LU,UR, RR, CD) indicating the occurrence of 
| dirty reads, lost updates, unrepeatable reads, resid- 
| ual resources and committed dependencies in S (t). 


Recovery Point: A recovery point p; in thread T; 
represents an execution point to which control is 





states of thread T; up to time t. The schedule S(t) 
at time t is the interleaving of the sequence of actions 
in %;(t) for each thread T;. Let v denote a globally 
shared structure protected by a lock. Table 1 shows 
the list of valid states for a thread. 

The system implements micro-recovery at a 
thread granularity. Any failure that cannot be han- 
dled by micro-recovery is resolved using a system- 
level recovery mechanism (e.g. software reboots). 

The state space for system execution consists of 
all legitimate schedules S(t). System states that rep- 
resent the failed state of one of the executing threads 
are relevant from the perspective of micro-recovery. 
To simplify the subsequent discussion, we apply the 
following rules to reduce the state space: 


e We consider the interactions between only two 
threads T; and T». 

e We only consider system states where the last 
state of thread JT; is TF. 

e Only 7, encounters a failure. Failures of thread 
Tz are symmetric and can be treated similarly. 

e Read or write actions performed by T> before any 
such actions by 7 are ignored. 

e We assume that the system can recover from only 
a single failure. Failure during recovery results in 
system-level failure recovery. 

e The externally visible action is equivalent to a 
‘commit action’ that cannot be rolled back. 

Occurrences of the following patterns in the 
schedule S(t) are of interest and relevant to the se- 
lection of a recovery strategy by thread T,. Let — 
denote the “happened before” relation [20]. 

e Dirty Read (DR): 1.w-T2R-T)F. 

e Lost Update (LU): mw—-mw—T)F. 

e Unrepeatable Read (UR): 1. R-mW-T1F. 

e Residual Resources (RR): 
(TROTLF)\(MUYTiF) or (TWO F)A(T1U+T1F) 
or (TAS Ti F)A(Ti Rew Ti F). 

e Committed Dependency 
TiW-T2R-T2E-T1F or 
OY T)R>T2W-+T2E>T1F. 


(CD): 
T1W>T2W >T2,E>T1 F 
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transferred at the end of a recovery procedure. A 
default recovery point defined for all threads is the 
initial system state. 


Recovery Criterion: Each recovery point p; is as- 
sociated with a recovery criterion C; which is a 4- 
tuple (DR, LU,UR, RR) that represents the set of 
criteria for dirty reads, lost updates, unrepeatable 
reads and residual resources, that the system state 
should satisfy before recovery can be attempted us- 
ing p;. For the default recovery point, all elements 
of the recovery criterion are defined as “don’t care”. 


CD does not figure in the recovery criterion since 
this information is used only to choose between al- 
ternate recovery strategies in the recovery handler. 
We discuss the use of CD conditions during recov- 
ery in the state recovery protocol in Section 3.2. In 
our current design, recovery points and their associ- 
ated recovery handlers are identified by developers 
and are associated to an execution context. When 
a thread leaves a context, the associated recovery 
points go out of scope. Within a single execution 
context, multiple recovery points may be defined, 
any of which could potentially be used during re- 
covery. Then the appropriate recovery point for the 
current failure scenario is chosen by the logic in the 
recovery handler. In the developer-specified recovery 
handler, the feasibility and correctness of restoring 
the failed system state using a recovery point, is de- 
termined using the resource and state recovery pro- 
tocols described next. Once the valid recovery points 
have been identified from the available choices, the 
selection of an appropriate recovery point and re- 
covery strategy may be a decision depending upon 
factors such as the amount of resources available for 
recovery and the time required to complete recovery. 


3.2 Restoration Protocols 


We consider the following possible recovery strate- 
gies: (1) Rollback; (2) Roll-forward style recovery or 
error compensation; (3) System-level recovery [21]. 
Of these the rollback and error compensation strate- 
gies may be applied to the failed thread only (single- 
thread recovery) or to multiple threads including the 
failed thread (multi-thread recovery). The following 
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protocols are based on the assumption that commit- 
ted dependencies cannot be rolled-back. 


Resource Recovery Protocol (RRP): System 
state can be restored to recovery point p; only if 
Ri(t) meets C; on the RR criterion. Otherwise, 
the thread must first attempt to release or acquire 
resources to meet the criterion. 

The state recovery protocol (SRP) specifies the 
recovery strategies applicable for different failure 
and recovery contexts. The rationale behind the 
SRP rules is that an occurrence of DR, LU or UR 
events imply that an interaction with other concur- 
rent threads in the system have occurred. When the 
restoration level does not meet the recovery criterion 
and interactions with other threads have occurred, 
then single thread recovery is no longer sufficient. 
Next, the success of multi-thread recovery depends 
on the occurrence of an externally visible action and 
whether the dependency has already been commit- 
ted. Concretely, the rules of state recovery are: 


State Recovery Protocol (SRP): 1. To _ per- 
form single-thread recovery and restore state to re- 
covery point p;, R;(t) should meet C; on every ele- 
ment of C;. 

2. If R(t) does not meet C; on DR, LU, UR condi- 
tions and CD occurs in S(t), then only error compen- 
sation or system-level recovery can be attempted. 
3. If R,(t) does not meet C; on DR, LU, UR condi- 
tions and CD has not been observed in S(t), then 
only multi-thread rollback, error compensation or 
system-level recovery is possible. 


4 lLog(Lock) Execution Model 


In this section, we present a concrete execution 
model of Log(Lock), that utilizes the state space 
analysis presented in the previous section. We show 
how to decide recovery strategies and how restora- 
tion levels can be tracked practically. Although the 
discussion in this paper focuses on a thread-level re- 
covery granularity, the Log(Lock) architecture can 
easily be extended to a more coarse granularity of 
micro-recovery such as at a task or component level. 

In a complex legacy system such as a storage 
controller, not all failures can be handled efficiently 
through fine-grained recovery - either because the 
failure and recovery code may be too complex, or 
system-level recovery may be a more effective recov- 
ery technique, or simply because there may be in- 
sufficient development and testing resources. There- 
fore, our approach first involves identifying candi- 
dates for fine-grained recovery based on the analysis 
of failure logs and the software itself. The execut- 
ing instance of each candidate is known as a recov- 
erable thread. Recall that, for each recoverable 


thread multiple recovery points and associated re- 
covery criterion may be defined. In the event of a 
failure, control is transferred to the recovery handler 
(Section 2.5). 


4.1 Tracking State Changes 


Log(Lock) is based on the intuition that all shared 
state and resources are protected by locks or similar 
synchronization primitives. Tracking lock/unlock 
calls can therefore guide the understanding of system 
state changes and provide the information required 
to identify the restoration level at the instant of fail- 
ure. At the same time, by tracking these calls on 
resources and applying the resource recovery proto- 
col, we can prevent deadlocks or resource starvation 
issues. In order to compute restoration levels and 
perform system state restoration, Log(Lock) main- 
tains the following: 


Undo Logs: Undo logs are local logs maintained by 
each recoverable thread for the following purposes: 
(1) Track the sequence of state changes within a sin- 
gle thread; (2) Track the creation of recovery points 
and (3)Track resource ownership. In general, the 
Undo logs can be used to encode any information re- 
quired by a thread’s recovery handler. In our current 
implementation, Undo-logging activities and main- 
tenance of the Undo logs are left to the developer. 


Change Track Logs: In order to track conflicts 
between concurrent threads, Log(Lock) maintains 
Change Track Logs for each lock. The Change Track 
Log is used to: (1) Track concurrent changes to 
shared structures and (2) Track commit actions. 

Both the Undo Log and Change Track Logs are 
maintained only in main memory and are verified for 
integrity using checksums. In our implementation, 
the change track log is implemented as a hashtable 
indexed using the pointer to the lock as key. Unlike 
database logs or checkpoints for state restoration, 
these logs do not need to be flushed to stable storage. 
If a failure crashes the system causing it to lose or 
corrupt the logs, then we must perform a system- 
level restart to restore the system to a consistent, 
functional state and no longer require the software’s 
state restoration logs from before the failure. 

Log(Lock) provides four basic primitives to a re- 
coverable thread: 


e startTracking(lock): Start tracking changes to 
the structure protected by lock. 

e stop Tracking (lock): Stop tracking changes to the 
structure protected by lock. 

e getRestorationLevel(lock): Compute the restora- 
tion level for the structure protected by lock. 

e getResourceOwnership(lock): Get ownership in- 
formation (including lock ownership) for the 
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/* Recovery Criterion for R1: No residual resources */ 
Owner = getResourceOwnership(&numActive UsersLock); 
/* Acquire ownership in write mode for consistent recovery*/ 
if( Owner == ReadMode) { 
unlockRead(&numActiveUsersLock); 
lockWrite(&numActiveUsersLock); 
} else if(!Owner) 
lockWrite(&numActiveUsersLock); 
level = getRestorationLevel(&numActiveUsersLock); 


if (level indicates dirty reads or lost updates ) { 
/* Indicates write completed */ 
numActiveUsers -- ; 
}else { 
/* No other operations or write may not have completed */ 
Replace old value using the Undo log; 
} 
unlockWrite( &numActiveUsersLock); 
/* State restore complete. Jump to new execution point */ 
Jump to R1; 











Figure 5: State Restoration Using Log(Lock) 


structure protected by lock. 


All the above primitives are explicitly inserted 
into the code by the developer. The startTrack- 
ing call is used to trigger change tracking for shared 
state and resources protected by the lock param- 
eter. These accesses are identified by trapping 
lock/unlock calls. When the recoverable thread de- 
termines that the logs for a particular structure are 
no longer required, it explicitly issues a stop Track- 
ing call. In the event of a failure, the system trans- 
fers control to the designated recovery handler. The 
recovery handler can utilize the getRestorationLevel 
and getResourceOwnership primitives to determine 
the current restoration level and resource ownership 
and then invoke recovery procedures appropriately. 
The restoration level is determined by examining the 
undo and change track logs. 


4.2 Recovery Using Restoration Protocols 


The goal of our state restoration approach is to re- 
turn the system to a correct, functional and known 
state by performing localized recovery and state 
restoration actions. The recovery actions are tar- 
geted at only a small subset of the threads in the sys- 
tem and a small region of the total system state that 
has been identified as affected by failure-recovery. 
Figure 5 shows pseudo code for state restoration us- 
ing the restoration protocols and the Log(Lock) ar- 
chitecture for the scenario shown in Figure 1. As- 
sume that, the recovery criterion associated with 
recovery point R1 specifies that resources (numAc- 
tiveUsersLock) acquired after the recovery point 
should be released and does not care about occur- 
rences of DR, LU or UR events. As shown in the Fig- 
ure 5, the getResourceOwnership primitive is used to 
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determine ownership of the numActive UsersLock re- 
source. Then, if the restoration level indicates that 
a DR or LU event has occurred, that would imply 
that the thread has successfully completed incre- 
menting numActiveUsers in the first place. Then 
in order to rollback the failed thread execution cor- 
rectly to recovery point R1 without losing the work 
done by other threads, a matching decrement oper- 
ation would need to be performed. If however the 
change track logs indicate that no other thread has 
consumed data written by the failed thread, it could 
imply that the failed thread either did not complete 
its increment operation or was the last thread to up- 
date the value of numActive Users. In that case, the 
recoverable thread could use its undo log to undo 
its changes, if any. The developer of this recov- 
ery handler is expected to have used the Undo log 
interfaces to store the old value prior to modifica- 
tion. Once state restoration is complete, execution 
is transferred to recovery point R1. 

Similarly, in the case of the example in Figure 2, 
assume that the recovery criterion only specifies the 
constraint on releasing the temporary resource ac- 
quired after the recovery point. Therefore, the ge- 
tResourceOwnership primitive is used to obtain the 
current ownership status of the temporary resource. 
If the resource is held by the thread, in order to 
rollback to recovery point R3, the resource must be 
cleanly relinquished. The pseudo code for this exam- 
ple and the next is not shown due to lack of space. 

In the case of the failure scenario shown in Fig- 
ure 3, the recovery criterion for recovery point R4 
would be that no resources acquired after the re- 
covery point (such as lock MetadataLocationLock) 
should be held by the thread and that no DR or LU 
events should have occurred. If the restoration level 
indicates that no other thread has already consumed 
this value (i.e., no DR or LU events have occurred), 
then the changes of the failed thread can be undone 
safely by replacing with the values in the Undo log. 
However, if the value is likely to have been consumed 
by another thread (i.e. DR or LU occurred), then 
the restoration level does not meet the recovery cri- 
terion for R4. So, in accordance with SRP, the error 
cannot be handled using single-thread recovery. De- 
pending upon the support for multi-thread recovery 
(provided the CD event has not occurred) recovery 
may require rollbacks of multiple threads. If how- 
ever, CD has occurred, then system-level recovery 
or error-compensation is performed. 


4.3. Implementation Details 


Undo logs go out of scope i.e., can be purged when 
a recoverable thread completes execution. Similarly, 
change track logs for a lock are purged when the 
recoverable thread issues a stopTracking call. How- 
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ever, unlike undo logs, change track logs cannot be 
purged immediately since these centralized logs may 
be shared by multiple recoverable threads. In that 
case, the log entries corresponding to the purging 
thread are only marked for purging and are actually 
purged when the last recoverable thread using the 
log issues a stop Tracking call on that lock. 

Multi-thread recovery i.e., applying state restora- 
tion and recovery to more than one thread, can 
typically handle more failure scenarios compared to 
single-thread recovery. However, multi-thread re- 
covery is complex to implement. Moreover, multi- 
thread recovery may result in a domino effect [22] 
(also referred to as cascading aborts) potentially re- 
sulting in unavailability of resources and unbounded 
recovery time|6]. A simpler and more effective tech- 
nique would be to limit recovery to a single thread 
and ensure recovery success through other mech- 
anisms such as dependency tracking and schedul- 
ing. Recovery conscious scheduling [6] describes an 
approach where dependencies between concurrent 
threads are identified and dependent threads seri- 
alized. This approach can help limit the number of 
concurrent dependent threads and increase single- 
thread recovery success. 


5 Experiments 


We have implemented the Log(Lock) architecture 
for system state restoration and micro-recovery on 
an industry standard, high-performance storage con- 
troller and applied Log(Lock) to a variety of state 
and resource locks. In this section, we present 
our evaluation of Log(Lock) with respect to perfor- 
mance, failure recovery and scalability. We next de- 
scribe our experimental setup, evaluation metrics, 
experimentation methodology and results. 

We identified state and resource instances that 
are changed or accessed rapidly through the obser- 
vation periods, based on instrumenting the system 
(Table 2). We also identified representative fail- 
ure scenarios by analyzing bug reports, failure logs 
and code. Using these scenarios as candidates for 
micro-recovery and state restoration, we evaluate 
Log(Lock) efficiency and effectiveness. In summary, 
our results show that: 


e The Log(Lock) architecture imposes negligible 
overhead and sustains high performance (< 10% 
impact) under a variety of workloads, even while 
tracking rapidly changing state (nearly 15K 
times/second) for significant durations. 

e We observe an extremely high rate of recovery 
success (>99%), i.e., percentage of time restora- 
tion levels meet recovery criterion. This high rate 
of recovery success makes it evident that micro- 
recovery with Log(Lock) can be a promising ap- 


proach to system recovery from transient failures. 


e The Log(Lock) approach exhibits significant im- 
provement in availability, replacing a four sec- 
ond downtime without micro-recovery with only 
a 35% performance impact lasting six seconds 
with Log(Lock). 


5.1 Experimental Setup 


We implemented the Log(Lock)-based state restora- 
tion architecture in an enterprise-class high perfor- 
mance, highly concurrent embedded storage con- 
troller. The system consists of a 4-way processor 
complex (4 3.00 GHz Xeon 5160 processors with 
12 GB memory running IBM MCP Linux) running 
the controller software over a simulated backend. 
The controller implements persistent memory (non- 
volatile storage) for write caching. Simulating the 
backend allows flexibility in terms of experimenting 
with different configurations such as read/write la- 
tencies and error injection. The back end configura- 
tion varied between 50-250 disks of 1O0GB each with 
the maximum read and write latencies of the disk set 
to 20 ms. The memory footprint of our implemen- 
tation of the Log(Lock) architecture was less than 
A8KB. The host functionality was performed from a 
different system (2 1.133 GHz Pentium III processor 
with 1 GB memory, RHLinux 9) connected to the 
storage complex through a high-bandwidth (2 GB) 
fiber channel interconnect. 

Our workload was generated using a randomized 
synthetic workload generator which took the follow- 
ing inputs: read/write ratio, block size and queue 
depth (i.e. maximum number of outstanding re- 
quests from the host). The experiments presented 
in this paper utilized three distinct read/write ra- 
tios: 100% writes, 50%-50% mix of reads and writes 
and 100% reads. Block size was set to 4 KB and 
queue depth varied between 16 and 256. 


5.2 Metrics 


Our experiments evaluate efficiency and effectiveness 
of the Log(Lock) architecture. Efficiency and effec- 
tiveness depend on the following parameters: (1) 
rate of access to shared state or resources and (2) 
duration of a recoverable thread. Increasing each 
of these parameters results in an increase in the log 
size, logging overhead and the probability of con- 
flicts. 

Efficiency refers to the impact of Log(Lock) on 
system performance. To measure performance, we 
utilize two metrics: throughput (IOs per second or 
IOps) and latency (seconds/IO). 

Effectiveness refers to the ability of the state 
restoration architecture to reduce the recovery time 
and positively impact the availability of the system. 
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Concretely, it refers to the probability of recovery 
success with the Log(Lock) architecture and the im- 
pact on system recovery time. 

Effectiveness is measured using the following met- 
rics: (1) recovery success, i.e. the percentage of time 
the restoration level meets the recovery criterion for 
single thread recovery, and (2) recovery time, i.e. the 
time required to restore the system to a consistent 
state after encountering a failure. Note that in the 
experiments reported in this paper we focus on single 
thread recovery while evaluating recovery success. 
While our Log(Lock) approach can also be applied 
to multi-thread recovery, as described in Section 4.3, 
multi-thread recovery can be costly in terms of cod- 
ing effort, resource consumption and recovery time. 
Instead, we assume that a technique such as recov- 
ery conscious scheduling [6] can help reduce the need 
for multi-thread recovery and improve the success of 
single thread recovery. 


5.3. Methodology 


In order to evaluate Log(Lock), we first identify state 
and resource instances in the software for tracking. 
We instrumented the system to identify top locks in 
terms of access and contention. Table 2 shows the 
top five locks in terms of number of accesses and con- 
tention. The table shows the semantics of the lock 
(ie. the state or resource protected), the number 
of CPU cycles lost to contention, number of occur- 
rences of contention (> 2000 CPU cycles), number 
of accesses to the lock and the average number of 
lock acquisitions per IO. Frequently acquired locks 
are indicative of state that is accessed or modified of- 
ten. For example, Table 2 shows that the fiber chan- 
nel lock accessed nearly 10 times per IO is a good 
candidate for evaluating the efficiency of Log(Lock). 
Contention, while indicative of longer durations of 
holding locks, also shows a higher probability of ac- 
cesses by concurrent threads. As Table 2 shows, the 
percentage of accesses resulting in lock contention is 
low as a result of the highly concurrent design of the 
controller. Thus, for short durations of tracking we 
expect high recovery success. 

To evaluate effectiveness, we first measure the 
recovery success for the candidates identified from 
Table 2. We measure recovery success across locks 
with different rates of access and varying duration of 
tracking. To evaluate the impact on recovery time, 
we identify candidates for state restoration based on 
analysis of the software, failure logs and defects. 

We present evaluation of the efficiency of our 
Log(Lock) architecture as compared to the original 
system, henceforth referred to as baseline. The base- 
line implementation does not perform state restora- 
tion or fine-grained recovery. Instead, it uses a 
highly efficient system level recovery mechanism that 
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Table 2: Lock Access over 75 minutes 


























Lock Contention Number of | Locks/IO 
Cycles (Count) | accesses 

Fiber channel | 2654991 (578) | 137196747 | 10.34 

IO state 219969 (76) 90122610 | 6.79 

Resource 608103 (100) | 63482290 | 4.78 

Resource state | 124965 (52) 30040757 =| 2.26 

Throttle timer | 79848 (11) 113316 0.0085 
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Figure 6: Rate vs Throughput (100% Writes) 


checks all persistent system structures such as non- 
volatile data in the write cache for consistency, reini- 
tializes software state and redrives lost tasks. Note 
that no hardware reboot is involved. 

An alternative approach to Log(Lock) is to imple- 
ment schemes such as strict 2-phase locking (2PL), 
commonly used in transactional systems. Essen- 
tially, these protocols require locks to be held for the 
entire duration of a recoverable thread. However, 
due to the high degree of concurrency in the sys- 
tem and the implementation of lock timeouts, such a 
scheme when implemented in our storage controller 
software caused lock timeouts and failed to bring 
up the system. Therefore, throughout this evalua- 
tion section, we primarily use the baseline system 
for comparison. 


5.4 Efficiency of Log(Lock) 


In order to measure efficiency, we compare the per- 
formance of the Log(Lock) architecture with the 
baseline system during failure-free operation. 


5.4.1 Effect of Frequency of State Change 


As described in Section 5.2, as the rate of accesses to 
a state variable or resource being tracked increases, 
the logging overhead increases. The workloads used 
for this experiment consisted of 100% write IOs and 
the data is averaged over 10 runs of 10 minutes each. 
The queue depth is represented on the x-axis. For 
this experiment, we chose four locks from Table 2, 
representative of a range of access rates, ranging 
from 12.5 times/second to 15244 times/second. The 
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Table 3: % Duration of Tracking vs Latency 
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Table 4: % Overhead (other workloads) 
























































Queue (Duration of tracking in CPU Cycles) Queue Workload 1 Workload 2 | 
Depth 2894 7258 20228 | 34642 69830 Depth | Through- | Latency | Through- | Latency 
% Increase in latency over baseline -put -put 

16 2.03% | 0.68% | 0.00% | 4.05% | 9.46% 16 0.43% 0.47% 0.08% ~0.00% 
32 1.69% | 0.34% | 0.34% | 4.39% | 10.47% 32 0.25% ~0.00% | 0.78% 0.75% 

64 2.72% | 0.34% | 0.51% | 4.76% | 10.71% 64 0.24% 0.39% 0.13% ~0.00% 
128 2.54% | 0.85% | 0.00% | 5.08% | 9.32% 128 0.29% 0.39% 0.79% 0.75% 
256 2.10% | 0.00% | 0.42% | 2.94% | 8.82% 256 0.25% 0.00% 0.12% 0.19% 





duration of tracking was 2600 CPU cycles on average 
(and standard deviation 265 CPU cycles). 


Figure 6 shows the throughput with varying ac- 
cess rates under different queue depths. The num- 
bers show that even for high access rates, the 
Log(Lock) approach has negligible impact on perfor- 
mance. The lock with access rate 14107 times/sec 
(the resource pool lock) was tracked for 2429 CPU 
cycles and results in a 4.5% drop in throughput. 
We attribute this to the occurrence of nested lock 
conditions in that particular code path, causing the 
system to be sensitive to even the small delay intro- 
duced by Log(Lock). 


Figure 7 shows the variation of latency with queue 
depth for different access rates. The curves for 
the various access rates almost completely overlap 
showing that across configurations, the impact of 
Log(Lock) on latency, even for high access rates, 
is negligible. The observation that the latency in- 
creases with queue depth is a queuing effect com- 
monly observed in systems [23] and is independent 
of Log(Lock). Figure 8 zooms into the points for 
queue depth 16 to give the reader a closer look at 
the data. As in the case of throughput, latency in- 
creases by ~4% for the resource pool lock and is at- 
tributed to the occurrence of nested lock situations 
in the code path. The important message from Fig- 
ures 6 and 7 is that Log(Lock) tracking can sustain 
high performance even while tracking rapidly modi- 
fied/accessed state or resources. 
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5.4.2 Effect of Duration of Tracking 


Figures 9 and Table 3 show the variation of sys- 
tem performance with different durations of track- 
ing. The durations were measured in terms of num- 
ber of CPU cycles between the startTracking and 
stop Tracking calls, averaged over 10 runs of 10 min- 
utes each. The independent parameter queue depth 
is shown on the x-axis. The data represents the per- 
formance for candidate locks from Table 2 that were 
tracked for different durations ranging from 2894 
CPU cycles to 69830 CPU cycles (IO state for 2894 
and 69830 CPU cycles, timer, fiber channel and re- 
source pool for 7258, 20228 and 34642 CPU cycles 
respectively). The numbers were chosen to be repre- 
sentative of a range of tracking durations. Since no 
functional code was modified, rather than varying 
the duration of a single lock, different locks were in- 
strumented to obtain this range. The rate of access 
of each lock varied as shown in Table 5. 

From Figures 9 and Table 3 we observe that, the 
performance of the system with Log(Lock) is com- 
parable to the baseline system across various queue 
depths. For the IO state lock (a lock in the IO path), 
when the duration of tracking was increased from 
2894 CPU cycles to 69830 CPU cycles, the through- 
put dropped by 8.85% and response time increased 
by 9.76% on average compared to baseline. This 
drop in performance can be attributed to two fac- 
tors: (1) occurrence of more conflicts with increase 
in duration of tracking and (2) increased possibility 
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of encountering nested lock conditions, which are 
sensitive to the delay introduced by tracking. In 
the case of the resource lock, a tracking duration to 
34642 CPU cycles resulted in a drop of only 4.3%, 
which is nearly identical to the performance with a 
tracking duration of only 2429 CPU cycles, as shown 
in Section 5.4.1. We conclude that, though the over- 
head of tracking is a function of both the frequency 
and duration of tracking, it is more significantly im- 
pacted by the semantics of the lock being tracked 
and the efficiency of the code path involving the lock. 


5.4.3. Performance with Other Workloads 


Table 4 show the throughput and latency with four 
other workloads. The figures compare the perfor- 
mance of a system powered by Log(Lock) and the 
baseline system under varying queue depths for the 
following workloads: Workload-1 (100% read, disk 
latency lms), and Workload-2 (50% read, disk la- 
tency lms). Data from tracking the fiber channel 
lock (15244 times/sec for 20228 CPU cycles each) is 
shown. Overall, the impact on performance was < 
1% in all cases. These results reiterate the observa- 
tion that Log(Lock) is lightweight and sustains high 
performance for a range of workloads. 

Examining the object code for our implemen- 
tation showed that in the event of a lock being 
tracked, fewer than 200 assembly instructions were 
added to the code path. Assuming one instruc- 
tion executes per CPU cycle, even at a frequency of 
15244 times/second, on a 3.00 GHz processor, this 
amounts to a time overhead of less than 1% (assum- 
ing that the size of the state being saved to undo logs 
is small). Also, note that storage controller code 
by itself is aggressively optimized to sustain high 
throughput, minimize the duration of locks in the 
I/O path and avoid nesting of locks to a large extent. 
Unlike checkpoints, which require a large amount of 
state to be copied to stable storage, our techniques 
copy small amounts of relevant state and informa- 
tion in memory only. The combination of all these 
factors results in the Log(Lock) system being able to 
sustain high performance despite an extremely high 
frequency of access to shared state and resources. 
In conclusion, we believe that the scenarios where 
performance will be impacted by tracking are when 
there are multiple levels of nesting with frequently 
accessed locks, increasing sensitivity to tracking de- 
lay. However, we expect that these situations are 
uncommon in well-designed systems. 


5.5 Effectiveness of Log(Lock) 


The next set of experiments are focused on evaluat- 
ing the effectiveness of a micro-recovery framework 
with Log(Lock) in improving system recovery. 
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5.5.1 Recovery Success 


The first metric of effectiveness is recovery success 
i.e., the percentage of time the restoration level 
meets the recovery criterion at the end of execution 
of a recoverable thread. This metric demonstrates 
the opportunity for micro-recovery in the system 
and evaluates if the system can effectively utilize 
Log(Lock)-based state restoration. Table 5 shows 
the recovery success for locks of varying semantics, 
rates of access and duration of tracking. The IO 
state lock was tracked for two types of recoverable 
threads, for a duration of 2894 CPU cycles in one 
and 69830 CPU cycles in the other. Hence data for 
this lock appears twice in Table 5. For each lock, the 
recovery criterion, the number of tracking threads 
per second, the rate of access, duration of track- 
ing and recovery success are shown. The restoration 
level in each case was obtained by calling the ge- 
tRestorationLevel method before stopTracking, and 
recovery success was computed as the percentage of 
time the restoration level met the recovery criterion. 
As Table 5 shows, our storage controller exhibits a 
high rate of recovery success for a range of locks, 
even with high rates of access. We conclude that, for 
failures involving the restoration of these instances 
of state and resources, fine-grained recovery presents 
an effective recovery strategy. 


5.5.2 Recovery Time 


To illustrate the impact of Log(Lock)-based micro- 
recovery on the overall recovery time and availability 
of the controller software, we injected transient fail- 
ures that disappeared on retry. The failures required 
restoration of the IO state to its previous value and 
a retry of the function. For the Log(Lock) system, 
the recovery criterion for IO state was set as shown 
in Table 5. Once the failure was injected, the thread 
verified if the restoration level at the time of recovery 
met the recovery criterion, before attempting state 
restoration and retry. The tracking duration was 
equivalent to the set up with 69830 CPU cycles. 
Figures 10 and 11 show the variation of through- 
put and latency respectively over time. The points 
of failure injection are marked in the figures. The 
throughput and latency shown are for a workload 
with 100% write IOs, queue depth 64 and disk la- 
tency 20 ms. The Log(Lock) architecture is com- 
pared to system-level recovery (abbreviated as SLR) 
in the case of the baseline system. Recall that 
SLR is implemented entirely in software and in- 
volves restarting the controller process and verify- 
ing data structures and cache data for consistency 
before redriving IO transactions. Overall, during 
failure-free operation, the average throughput and 
latency respectively with Log(Lock) is 708/Ops, 
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Table 5: Recovery Success with the 100% Write Workload 
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Lock Recovery Tracking Calls | #Access Duration Recovery 
Criterion (times/sec) (times/sec) | CPU cycles | Success 
| Fiber channel | No Residual Resources | 3666 15244 20228 100% 
| IO state No DR, LU or UR 2500 10266 2894 99.88% 
| Resource pool | No Residual Resources | 10 14107 34642 100% 
| Resource state | No Residual Resources | 5 6675 4806 100% 
| Throttle timer | No Residual Resources | 10 12.59 7258 100% 
| IO state No DR, LU or UR 2444 10045 69830 99.38% 
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Figure 10: Throughput with Error Injection 


0.0946 sec/IO and 7102Ops, 0.0912 sec/IO for the 
baseline system. 

Log(Lock)-enabled micro-recovery imposes a 35% 
performance overhead lasting six seconds during re- 
covery. However, system-level recovery results in 
4 seconds downtime and it takes an additional 2 
seconds to begin sustaining high performance. It 
is important to remember that as the size of the 
system and in-memory data structures increase, the 
recovery time for SLR is bound to increase. This, 
along with the opportunity for micro-recovery illus- 
trated by the high recovery success shown in the 
previous experiment, further promote the case for 
micro-recovery in high performance systems like the 
storage controller. 


6 Related Work 


Our work is largely inspired by previous work in 
the area of transactional systems, software fault tol- 
erance and system availability. Hardware redun- 
dancy and software redundancy [24], rejuvenation [9] 
or fault isolation approaches such as isolating VMs 
from the failure of other VMs [14] are complemen- 
tary to our techniques and are already deployed in 
our setups. Since these approaches are targeted at 
handling failures at a different level they focus on a 
coarser granularity of recovery compared to our tech- 
niques. Failure-oblivious computing [25] introduces 
a novel method to handle failures - by ignoring them 
and returning possibly arbitrary values. This tech- 
nique may be applicable to systems like search en- 
gines where a few missing results may go unnoticed, 
but is not an option in storage controllers where ig- 
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Figure 11: Latency with Error Injection 


noring failures or returning arbitrary values could 
lead to data corruption. 

Application-specific recovery mechanisms such as 
recovery blocks [22], and exception handling [26] are 
used in many software systems. Constructs such as 
try/throw/catch [27] can be used to transfer con- 
trol to an exception handler and a similar excep- 
tion model is used by our implementation. However 
such exception handling constructs alone are insuf- 
ficient for performing micro-recovery which requires 
richer failure context information. The goal of the 
Log(Lock) architecture is to provide this context in- 
formation and provide the developer with a set of 
guidelines to decide the precise way in which the 
system should be restored given the failure context. 

Logging of access patterns has been used for de- 
terministic replay [15, 16, 17] during debugging. 
However, in micro-recovery, there is no requirement 
to perform deterministic replay. Also, the purpose 
of logging access patterns in Log(Lock) is to identify 
recovery dependencies between concurrent threads. 


7 Conclusion 


We have presented Log(Lock), a practical and flex- 
ible architecture for tracking dynamic dependencies 
and performing state restoration without rearchi- 
tecting legacy code. By exploring system state 
space, we formally model thread dependencies based 
on both state and shared resources, capturing failure 
contexts through different ‘restoration levels’. We 
develop recovery strategies in the form of restora- 
tion protocols based on recovery points and restora- 
tion levels. A comprehensive experimental evalua- 
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tion shows that Log(Lock)-enabled micro-recovery 
is both efficient and effective in reducing system re- 
covery time. 

Even with retrofittable mechanisms such as 
micro-recovery, we emphasize that failure recovery 
should be a design concern. One approach to reduc- 
ing recovery time would be to design the software us- 
ing components with independent failure modes (e.g. 
client-server interactions) or use a state space based 
approach where transitions to functional states can 
be identified even from a failure state. 

Our effort in designing scalable failure recovery 
continues along a number of directions. One of 
our ongoing efforts is to reduce the need for pro- 
grammer intervention in defining recovery actions. 
We are also interested in deploying and evaluating 
Log(Lock) in other high performance systems. 
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Abstract 


Traditional caching policies are known to perform poorly 
for storage server caches. One promising approach to 
solving this problem is to use hints from the storage 
clients to manage the storage server cache. Previous 
hinting approaches are ad hoc, in that a predefined re- 
action to specific types of hints is hard-coded into the 
caching policy. With ad hoc approaches, it is difficult to 
ensure that the best hints are being used, and it is diffi- 
cult to accommodate multiple types of hints and multiple 
client applications. In this paper, we propose CLient- 
Informed Caching (CLIC), a generic hint-based policy 
for managing storage server caches. CLIC automatically 
interprets hints generated by storage clients and trans- 
lates them into a server caching policy. It does this with- 
out explicit knowledge of the application-specific hint 
semantics. We demonstrate using trace-based simula- 
tion of database workloads that CLIC outperforms hint- 
oblivious and state-of-the-art hint-aware caching poli- 
cies. We also demonstrate that the space required to track 
and interpret hints is small. 


1 Introduction 


Multi-tier block caches arise in many situations. For ex- 
ample, running a database management system (DBMS) 
on top of a storage server results in at least two caches, 
one in the DBMS and one in the storage system. The 
challenges of making effective use of caches below the 
first tier are well known [15, 19, 22]. Poor temporal lo- 
cality in the request streams experienced by the second- 
tier caches reduces the effectiveness of recency-based re- 
placement polices [22], and failure to maintain exclusiv- 
ity among the contents of the caches in each tier leads to 
wasted cache space [19]. 

Many techniques have been proposed for improving 
the performance of second-tier caches. Section 7 pro- 
vides a brief survey. One promising class of techniques 
relies on hinting: the application that manages the first- 


tier cache generates hints and attaches them to the I/O 
requests that it directs to the second tier. The cache at the 
second tier then attempts to exploit these hints to improve 
its performance. For example, an importance hint [6] in- 
dicates the priority of a particular page to the buffer cache 
manager in the first-tier application. Given such hints, 
the second-tier cache can infer that pages that have high 
priority in the first tier are likely to be retained there, and 
can thus give them low priority in the second tier. As 
another example, a write hint [11] indicates whether the 
first tier is writing a page to ensure recoverability of the 
page, or to facilitate replacement of the page in the first- 
tier cache. The second tier may infer that replacement 
writes are better caching candidates than recovery writes, 
since they indicate pages that are eviction candidates in 
the first tier. 


Hinting is valuable because it is a way of making 
application-specific information available to the second 
(or lower) tier, which needs a good basis on which to 
make its caching decisions. However, previous work has 
taken an ad hoc approach to hinting. The general ap- 
proach is to identify a specific type of hint that can be 
generated from an application (e.g., a DBMS) in the first 
tier. A replacement policy that knows how to take advan- 
tage of this particular type of hint is then designed for the 
second tier cache. For example, the TQ algorithm [11] is 
designed specifically to exploit write hints. The desired 
response to each possible hint is hard-coded into such an 
algorithm. 

Ad hoc algorithms can significantly improve the per- 
formance of the second-tier cache when the necessary 
type of hint is available. However ad hoc algorithms also 
have some significant drawbacks. First, because the re- 
sponse to hints is hard-coded into an algorithm at the sec- 
ond tier, any change to the hints requires changes to the 
cache management policy at the second-tier server. Sec- 
ond, even if change is possible at the server, it is difficult 
to generalize ad hoc algorithms to account for new situ- 
ations. For example, suppose that applications can gen- 
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erate both write hints and importance hints. Clearly, a 
low-priority (to the first tier) replacement write is prob- 
ably a good caching candidate for the second tier, but 
what about a low-priority recovery write? In this case, 
the importance hint suggests that the page is a good can- 
didate for caching in the second tier, but the write hint 
suggests that it is a poor candidate. One response to this 
might be to hard code into the second-tier cache manager 
an appropriate behavior for all combinations of hints that 
might occur. However, each new type of hint will mul- 
tiply the number of possible hint combinations, and it 
may be difficult for the policy designer to determine an 
appropriate response for each one. A related problem 
arises when multiple first-tier applications are served by 
a single cache in the second tier. If different applications 
generate hints, how is the second tier cache to compare 
them? Is a write hint from one application more or less 
significant than an importance hint from another? 

In this paper, we propose CLient-Informed Caching 
(CLIC), a generic technique for exploiting application 
hints to manage a second-tier cache, such as a storage 
server cache. Unlike ad hoc techniques, CLIC does not 
hard-code responses to any particular type of hint. In- 
stead, it is an adaptive approach that attempts to learn to 
exploit any type of hint that is supplied to it. Applica- 
tions in the first tier are free to supply any hints that they 
believe may be of value to the second tier. CLIC analyzes 
the available hints and determines which can be exploited 
to improve second-tier cache performance. Conversely, 
it learns to ignore hints that do not help. Unlike ad hoc 
approaches, CLIC decouples the task of generating hints 
(done by applications in the first tier) from the task of 
interpreting and exploiting them. CLIC naturally accom- 
modates applications that generate more than one type of 
hint, as well as scenarios in which multiple applications 
share a second-tier cache. 

The contributions of this paper are as follows. First, 
we define an on-line cost/benefit analysis of I/O request 
hints that can be used to determine which hints provide 
potentially valuable information to the second-tier cache. 
Second, we define an adaptive, priority-based cache re- 
placement policy for the second-tier cache. This policy 
exploits the results of the hint analysis to improve the hit 
ratio of the second-tier cache. Third, we use trace-based 
simulation to provide a performance analysis of CLIC. 
Our results show that CLIC outperforms ad hoc hinting 
techniques and that its adaptivity can be achieved with 
low overhead. 


2 Generic Framework for Hints 


We assume a system in which multiple storage server 
client applications generate requests to a storage server, 
as shown in Figure 1. We are particularly interested in 
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Figure 1: System Architecture 


client applications that cache data, since it is such appli- 
cations that give rise to multi-tier caching. 

The storage server’s workload is a sequence of block 
I/O requests from the various clients. When a client 
sends an I/O request (read or write) to the server, it may 
attach hints to the request. Specifically, each storage 
client may define one or more hint types and, for each 
such hint type, a hint value domain. When the client is- 
sues an I/O request, it attaches a hint set to the request. 
Each hint set consists of one hint value from the domain 
of each of the hint types defined by that client. For exam- 
ple, we used IBM DB2 Universal Database! as a storage 
client application, and we instrumented DB2 so that it 
would generate five types of hints, as described in the 
first five rows of Figure 2. Thus, each I/O request is- 
sued by DB2 will have an attached hint set consisting of 
5 hint values: a pool ID, an object ID, an object type ID, 
a request type, and a DB2 buffer priority. 

CLIC does not require these specific hint types. We 
chose these particular types of hints because they could 
be generated easily from DB2, and because we believed 
that they might prove useful to the underlying storage 
system. Each application can generate its own types of 
hints. CLIC itself only assumes that the hint value do- 
mains are categorical. It neither assumes nor exploits 
any ordering on the values in a hint value domain. Each 
storage client application may have its own hint types. In 
fact, even if two storage clients are instances of the same 
application (e.g., two instances of DB2) and use the same 
hint types, CLIC treats each client’s hint types as distinct 
from the hint types of all other clients. 


3 Hint Analysis 


Every I/O request, read or write, represents a caching op- 
portunity for the storage server. The storage server must 
decide whether to take advantage of each such opportu- 
nity by caching the requested page. Our approach is to 
base these caching decisions on the hint sets supplied by 
the client applications with each I/O request. CLIC as- 
sociates each possible hint set H with a numeric priority, 
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Value Value 
Hint Domain Domain 
DBMS Type Cardinality | Cardinality | Description 
(TPC-C) (TPC-H) 

DB2 pool ID 2 5 Identifies which DB2 buffer pool generated the I/O re- 
quest. 

DB2 object ID 21 23 Identifies a group of related database objects, such as a 
table and its associated indices. 

DB2 object type ID 6 9 Identifies object type, such as table or index. Together, a 
pool ID, object ID and object type ID uniquely identify 
a database object. 

DB2 request type 5 5 For read requests, distinguishes regular reads from 
prefetch reads. For writes, provides write hints ([{11]), 
which distinguish between recovery writes, replacement 
writes, and synchronous writes. Synchronous writes are 
replacement writes that are not performed by an asyn- 
chronous page cleaning thread. 

DB2 buffer priority 4 1 Identifies the priority of the page in its DB2 buffer cache. 

MySQL thread ID - 5 ID of server thread that issued the request. 

MySQL | request type - 3 Read, replacement write, or recovery write. 

MySQL file ID - 9 MySQL is configured so that each table is stored in a sep- 
arate file, together with any indexes defined on that table, 
so this hint distinguishes groups of database objects. 

MySQL fix count - 2 indicates how many MySQL threads are have currently 
fixed (pinned) this page in the buffer pool 





Figure 2: Types of Hints in the DB2 and MySQL I/O Request Traces 


Pr(H). When an I/O request (read or write) for page p 
with attached hint set H arrives at the server, the server 
uses Pr(H) to decide whether to cache p. Cache man- 
agement at the server will be described in more detail 
in Section 4, but the essential idea is simple: the server 
caches p if there is some page p’ in the cache that was re- 
quested with a hint set H’ for which Pr(H’) < Pr(H). 


We expect that some hint sets may signal pages that are 
likely to be re-used quickly, and thus are good caching 
candidates. Other hint sets may signal the opposite. In- 
tuitively, we want the priority of each hint set to reflect 
these signals. But how should priorities be chosen for 
each hint set? One possibility is to assign these priorities, 
in advance, based on knowledge of the client applica- 
tion that generates the hint sets. Most existing hint-based 
caching techniques use this approach. For example, the 
TQ algorithm [11], which exploits write hints, under- 
stands that replacement writes likely indicate evictions in 
the client application’s cache, and so it gives them high 
priority. 

CLIC takes a different approach to this problem. In- 
stead of predefining hint priorities based on knowledge 
of the storage client applications, CLIC assigns a prior- 
ity to each hint set by monitoring and analyzing I/O re- 
quests that arrive with that hint set. Next, we describe 


how CLIC performs its analysis. 

We will assume that each request that arrives at the 
server is tagged (by the server) with a sequence number. 
Suppose that the server gets a request (p, H), meaning 
a request (read or a write) for a page p with an attached 
hint set H, and suppose that this request is assigned se- 
quence number s,;. CLIC is interested in whether and 
when page p will be requested again after s;. There are 
three possibilities to consider: 


write re-reference: The first possibility is that the next 
request for p in the request stream is a write request 
occurring with sequence number s2 (sz > 81). In 
this case, there would have been no benefit what- 
soever to caching p at time s;. A cached copy of 
p would not help the server handle the subsequent 
write request any more efficiently. A cached copy 
of p may be of benefit for requests for p that occur 
after sg, but in that case the server would be bet- 
ter off caching p at so rather than at s;. Thus, the 
server’s caching opportunity at s; is best ignored. 


read re-reference: The second possibility is that the 
next request for p in the request stream is read re- 
quest at time sz. If the server caches p at time s1 
and keeps p in the cache until so, it will benefit by 
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being able to serve the read request at s2 from its 
cache. For the server to obtain this benefit, it must 
allow p to occupy one page “slot” in its cache during 
the interval sy — s1. 


no re-reference: The third possibility is that p is never 
requested again after s;. In this case, there is clearly 
no benefit to caching p at 51. 


Of course, the server cannot determine which of these 
three possibilities will occur for any particular request, 
as that would require advance knowledge of the future 
request stream. Instead, we propose that the server base 
its caching decision for the request (p, H) on an analysis 
of previous requests with hint set H. Specifically, CLIC 
tracks three statistics for each hint set H: 


N(#): the total number of requests with hint set H. 


N,(H): the total number requests with hint set H that 
result in a read re-reference (rather than a write re- 
reference or no re-reference). 


D(H): for those requests (p, H) that result in read re- 
references, the average number of requests that oc- 
cur between the request and the read re-reference. 


Using these three statistics, CLIC performs a simple 
benefit/cost analysis for each hint set H, and assigns 
higher priorities to hint sets with higher benefit/cost ra- 
tios. Suppose that the server receives a request (p, H) 
and that it elects to cache p. If a read re-reference sub- 
sequently occurs while p is cached, the server will have 
obtained a benefit from caching p. We arbitrarily assign 
a value of 1 to this benefit (the value we use does not af- 
fect the relative priorities of pages). Among all previous 
requests with hint set H, a fraction 


frit(H 


eventually resulted in read re-references, and would have 
provided a benefit if cached. We call f7,;¢( 1) the read hit 
rate of hint set H. Since the value of a read re-reference 
is 1, fnie(H) can be interpreted as the expected benefit 
of caching and holding pages that are requested with hint 
set H. Conversely, D(H) can be interpreted as the ex- 
pected cost of caching such pages, as it measures how 
long such pages must occupy space in the cache before 
the benefit is obtained. We define the caching priority of 
hint set H as: 


) = N,(H)/N(H) (1) 


Snit(H) 
Pr(H) = ——~ 2. 
=H) = San Q) 
which is the ratio of the expected benefit to the expected 
cost. 
Figure 3 illustrates the results of this analysis for a 
trace of I/O requests made by DB2 during a run of the 
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Figure 3: Hint Set Priorities for the DB2_C60 Trace 


Each point represents a distinct hint set. All hint sets are shown. 


TPC-C benchmark. Our workload traces will be de- 
scribed in more detail in Section 6. Each point in Fig- 
ure 3 represents a distinct hint set that is present in the 
trace, and describes the hint set’s caching priority and 
frequency of occurrence. All hint sets with non-zero 
caching priority are shown. Clearly, some hint sets have 
much higher priorities, and thus much higher benefit/cost 
ratios, than others. For illustrative purposes, we have in- 
dicated partial interpretations of two of the hint sets in the 
figure. For example, the most frequently occurring hint 
set represents replacement writes to the STOCK table in 
the TPC-C database instance that was being managed by 
the DB2 client. We emphasize that CLIC does not need 
to understand that this hint represents the STOCK table, 
nor does it need to understand the difference between a 
replacement write and a recovery write. Its interpreta- 
tion of hints is based entirely on the hint statistics that it 
tracks, and it can automatically determine that a request 
with the STOCK table hint set is a better caching oppor- 
tunity than a request with the ORDERLINE table hint 
set. 


3.1 Tracking Hint Set Statistics 


To track hint set statistics, CLIC maintains a hint table 
with one entry for each distinct hint set H that has been 
observed by the storage server. The hint table entry for 1 
records the current values of the statistics N(H), N,(H) 
and D(H). When the server receives a request (p, 1), it 
increments N(#). Tracking N,.(H) and D(H) is some- 
what more involved, as CLIC must determine whether a 
read request for page p is a read re-reference. To de- 
termine this, CLIC records two pieces of information 
for every page p that is cached: seq(p), which is the 
sequence number of the most recent request for p, and 
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H(p), which is the hint set that was attached to the most 
recent request for p. In addition, CLIC records seq(p) 
and H(p) for a fixed number (Nout) of additional, un- 
cached pages This additional information is recorded in 
a data structure called the outqueue. Noutq is a CLIC 
parameter that can be used to bound the amount of space 
required for tracking read re-references. When the server 
receives a read request for page p with sequence number 
8, it checks both the cache and the outqueue for informa- 
tion about the most recent previous request, if any, for p. 
If it finds seq(p) and H(p) from a previous request, then 
it knows that the current request is a read re-reference of 
p. It increments N,.(H(p)) and it updates D(H(p)) using 
the re-reference distance s — seq(p). 

When a page p is evicted from the cache, an entry for 
p is inserted into the outqueue. An entry is also placed 
in the outqueue for any requested page that CLIC elects 
not to cache. (CLIC’s caching policy is described in Sec- 
tion 4.) If the outqueue is full when a new entry is to be 
inserted, the least-recently inserted entry is evicted from 
the outqueue to make room for the new entry. 

Since CLIC only records seq(p) and H(p) for a lim- 
ited number of pages, it may fail to recognize that a new 
read request (p, H) is actually a read re-reference for p. 
Some error is inevitable unless CLIC were to record in- 
formation about all requested pages. However, CLICs 
approach to tracking page re-references has several ad- 
vantages. First, since CLIC tracks the most recent refer- 
ence to all pages that are in the cache, we expect to have 
accurate re-reference distance estimates for hint sets that 
are believed to have the highest priorities, since pages 
requested with those hint sets will be cached. If the pri- 
ority of such hint sets drops, CLIC should be able to 
detect this. Second, by evicting the oldest entries from 
the outqueue when eviction is necessary, CLIC will tend 
to miss read re-references that have long re-reference 
distances. Conversely, read re-references that happen 
quickly are likely to be detected. These are exactly the 
type of re-references that lead to high caching priority. 
Thus, CLIC’s statistics tracking is biased in favor of read 
re-references that are likely to lead to high caching pri- 
ority. 


3.2. Time-Varying Workloads 


To accommodate time-varying workloads, CLIC divides 
the request stream into non-overlapping windows, with 
each window consisting of W requests. At the end of 
each window, CLIC adjusts the priority for each hint set 
using the statistics collected during that window. The 
adjusted priority will be used to guide the caching pol- 
icy during the next window. It then clears the statistics 
(N(H), N,(H), D(#)) for all hint sets in the hint table 
so that it can collect new statistics during the next win- 
dow. 


Let Pr(#H); represent the priority of H that is calcu- 
lated after the ith window, and that is used by CLIC’s 
caching policy during window 7 + 1. Priority Pr(); is 
calculated as follows 


Pr(H); =rPr(H);+(1—r)Pr(A)i-1 ~~ (3) 


where Pr(H); represents the priorities that were calcu- 
lated using the statistics collected during the 7th window 
(and Equation 2), and r (0 < r < 1) is a CLIC parame- 
ter. The effect of Equation 3 is that the impact of statis- 
tics gathered during the 7th window decays exponentially 
with each new window, at a rate that is controlled by r. 
Setting r = 1 causes CLIC to base its priorities entirely 
on the statistics collected during the most recently com- 
pleted window. Lower values of r cause CLIC to give 
more weight to older statistics. For all of the experiments 
reported in this paper, we have set W = 10° and r = 1. 


4 Cache Management 


In the previous section, we described how CLIC assigns 
a caching priority to each hint set H. In this section, we 
describe how the server uses these priorities to manage 
the contents of its cache. 


Figure 4 describes CLIC’s priority-based replacement 
policy. This policy evicts a lowest priority page from 
the cache if the newly requested page has higher prior- 
ity. The priority of a page is determined by the priority 
Pr(H) of the hint set H with which that page was last 
requested. Note that if a page that is cached after being 
requested with hint set H is subsequently requested with 
hint set H’, its priority changes from Pr(H) to Pr(H’). 
The most recent request for each cached page always de- 
termines its caching priority. 


The policy described in Figure 4 can be implemented 
to run in constant expected time. To do this, CLIC main- 
tains a heap-based priority queue of the hint sets. For 
each hint set H in the heap, all pages with H(p) = H are 
recorded in a doubly-linked list that is sorted by seq(p). 
This allows the victim page to be identified (Figure 4, 
lines 7-11) in constant time. CLIC also maintains a hash 
table of all cached pages so that it can tell which pages 
are cached (line 1) and find a cached page in its hint set 
list in constant expected time. Finally, CLIC implements 
the hint table as a hash table so that it can look up Pr(H) 
(line 12) in constant expected time. 


As described in Section 3.2, CLIC adjusts hint set pri- 
orities after every window of W requests. When this oc- 
curs, CLIC rebuilds its hint set priority queue based on 
the newly adjusted priorities. Hint set priorities do not 
change except at window boundaries. 
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1 if p is not cached then 

2. if the cache is not full then 

3 cache p 

4 set seq(p) = S 

5 set H(p) = H 

6 else 

7 let m be the minimum priority 

8 of all pages in the cache 

9 let v be the page with the 

10 minimum sequence number seq(v) 
ll among all pages with priority m 
12 if Pr(H)>m then 

13 evict v from the cache 

14 add entry for v (with seq(v) 

15 and H(v)) to the outqueue 

16 cache p 

17 set seq(p) = S 

18 set H(p) = H 

19 else /* do not cache p */ 

20 add entry for p to the outqueue 
21 set seq(p) ==S 

22 set H(p) =H 

23 else /x* p is already cached «/ 

24 seq(p) = s 

25 H(p) = H 


Figure 4: Hint-Based Server Cache Replacement Policy 
This pseudo-code shows how the server handles a request for 
page p with hint set H and request sequence number s. 


5 Handling Large Numbers of Hint Sets 


As described in Section 3.1, CLIC’s hint table records 
statistical information about every hint set that the server 
has observed. Although the amount of statistical infor- 
mation tracked per hint set is small, the number of dis- 
tinct hit sets from each client might be as large as the 
product of the cardinalities of that client’s hint value do- 
mains. In our traces, the number of distinct hit sets is 
small. For other applications, however, the number of 
hint sets could potentially be much larger. In this section, 
we propose a simple technique for restricting the number 
of hint sets that CLIC must consider, so that CLIC can 
continue to operate efficiently as the number of hint sets 
grows. 

All of the hint types in our workload traces exhibit fre- 
quency skew. That is, some values in the hint domain oc- 
cur much more frequently than others. As a result, some 
hint sets occur much more frequently than others. To re- 
duce the number of hints that CLIC must consider, we 
propose to exploit this skew by tracking statistics for the 
hint sets that occur most frequently in the request stream 
and ignoring those that do not. Ignoring infrequent hint 
sets may lead to errors. In particular, we may miss a hint 
set that would have had high caching priority. However, 
since any such missed hint set would occur infrequently, 
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the impact of the error on the server’s caching perfor- 
mance is likely to be small. 

The problem with this approach is that we must deter- 
mine, on the fly, which hint sets occur frequently, with- 
out actually maintaining a counter for every hint set. 
Fortunately, this frequent item problem arises in a vari- 
ety of settings, and numerous methods have been pro- 
posed to solve it. We have chosen one of these methods: 
the so-called Space-Saving algorithm [14], which has re- 
cently been shown to outperform other frequent item al- 
gorithms [7]. Given a parameter k, this algorithm tracks 
the frequency of & different hint sets, among which it 
attempts to include as many of the actual k most fre- 
quent hint sets as possible. It is an on-line algorithm 
which scans the sequence of hint sets attached to the re- 
quests arriving at the server. Although & different hint 
sets are tracked at once, the specific hint sets that are be- 
ing tracked may vary over time, depending on the request 
sequence. 

After each request has been processed, the algorithm 
can report the k hint sets that it is currently tracking, as 
well as an estimate of the frequency (total number of oc- 
currences) of each hint set and an error indicator which 
bounds the error in the frequency estimate. By analyzing 
the frequency estimates and error indicators, it is pos- 
sible to determine which of the & currently-tracked hint 
sets are guaranteed to be among the actual top-/ most 
frequent hint sets and which are not. However, for our 
purposes this is not necessary. 

We adapted the Space-Saving algorithm slightly so 
that it tracks the additional information we require for 
our analysis. Specifically: 


N(#H): For each hint set H that is tracked by the Space- 
Saving algorithm, we use the frequency estimate 
produced by the algorithm, minus the estimation er- 
ror bound reported by the algorithm, as V (#1). 


N,(H): We modified the Space-Saving algorithm to in- 
clude an additional counter for each hint set H that 
is currently being tracked. This counter is initialized 
to zero when the algorithm starts tracking H, and it 
is incremented for each read re-reference involving 
Hf that occurs while H is being tracked. We use the 
value of this counter as N,.(H). 


D(H): We track the expected re-reference distance for 
all read re-references involving H that occur while 
H is being tracked, i.e., those read re-references that 
are included in N,.(H). 


For all hint sets H that are not currently tracked by the 
algorithm, we take N,.() to be zero, and hence Pr(H) 
to be zero as well. 

In general, N(#) will be be an underestimate of the 
true frequency of hint set H. Since N,.(H) is only incre- 


USENIX Association 


USENIX Association 


mented while #7 is being tracked, it too will in general 
underestimate the true frequency of read re-references 
involving H. As a result of these underestimations, 
fnit( 1), which is calculated as the ratio of the N,.() to 
N(#), may be inaccurate. However, because we take the 
ratio of N(H) to N,(#), the two underestimations may 
at least partially cancel one another, leading to a more ac- 
curate f),;¢(47). In addition, the higher the true frequency 
of H, the more time H will spend being tracked and the 
more accurate we expect our estimates to be. 

To account for time-varying workloads, we restart the 
Space-Saving algorithm from scratch for every window 
of W requests. Specifically, at the end of each window 
we use the Space-Saving algorithm to estimate N(H), 
N,-(H), and D(H) for each hint set H that is tracked 
by the algorithm, as described above. These statistics 
are used to calculate Pr(H), which is then used in Equa- 
tion 3 to calculate the hint set’s caching priority (Pr(H)) 
to be used during the next request window. Once the 
Pr(H) have been calculated, the Space-Saving algo- 
rithm’s state is cleared in preparation for the next win- 
dow. 

The Space-Saving algorithm requires two counters for 
each tracked hint-set, and we added several additional 
counters for the sake of our analysis. Overall, the space 
required is proportional to k. Thus, this parameter can be 
used to limit the amount of space required to track hint 
set statistics. With each new request, the data structure 
used by the Space-Saving algorithm can be updated in 
constant time [14], and the statistics for the tracked hint 
sets can be reported, if necessary, in time proportional to 
k. 


6 Experimental Evaluation 


Objectives: We used trace-driven simulation to evaluate 
our proposed mechanisms. The goal of our experimental 
evaluation is to answer the following questions: 


1. Can CLIC identify good caching opportunities for 
storage server caches, and thereby improve the 
cache hit ratio in compared to other caching poli- 
cies? (Section 6.1) 


2. How effective are CLIC’s mechanisms for reduc- 
ing the number of hint sets that it must track (Sec- 
tions 6.2 and 6.3). 


3. Can CLIC improve performance for multiple stor- 
age clients by prioritizing the caching opportunities 
of the different clients based on their observed ref- 
erence behavior? (Section 6.4) 


Simulator: To answer these questions, we implemented 
a simulation of the storage server cache. In addition to 
CLIC, the simulator implements the following caching 
policies for purpose of comparison: 


OPT: This is an implementation of the well-known op- 
timal off-line MIN algorithm [4]. It replaces the 
cached page that will not be read for the longest 
time. This algorithm requires knowledge of the fu- 
ture so it cannot be used for cache replacement in 
practical systems, but its hit ratio is optimal so it 
serves as an upper bound on the performance of any 
caching algorithm. 


LRU: This algorithm replaces the least-recently used 
page in the cache. Since temporal locality is often 
poor in second-tier caches, we expect CLIC to per- 
form significantly better than LRU. 


ARC: ARC [13] is a hint-oblivious caching policy that 
considers both recency and frequency of use in 
making replacement decisions. 


TQ: TQ is a hint-aware algorithm that was proposed for 
use in second-tier caches [11]. Unlike the algo- 
rithms proposed here, it works only with one spe- 
cific type of hint that can be associated with write 
requests from database systems. We expect our pro- 
posed algorithms, which can automatically exploit 
any type of hint, to do at least as well as TQ when 
the write hints needed by TQ are present in the re- 
quest stream. 


The TQ algorithm has previously been compared to a 
number of other second-tier caching policies that are 
not considered here. These include MQ [22], a hint- 
oblivious policy, and write-hint-aware variations of both 
MQ and LRU. TQ was shown to be generally superior 
to those alternatives when the necessary write hints are 
present [11], so we use it as our representative of the state 
of the art in hint-aware second-tier caching policies. 

The simulator accepts a stream of I/O requests with as- 

sociated hint sets, as would be generated by one or more 
storage clients. It simulates the caching behavior of one 
of the five supported cache replacement policies (CLIC, 
OPT, LRU, ARC and TQ) and computes the read hit ra- 
tio for the storage server cache. The read hit ratio is the 
number of read hits divided by the number of read re- 
quests. 
Workloads: In this paper, we use DB2 Universal 
Database (version 8.2) and the MySQL? database sys- 
tem (Community Edition, version 5.0.33) as our stor- 
age system clients. DB2 is a widely-used commercial 
relational database system to which we had access to 
source code, and MySQL is a widely-used open source 
relational database system. We instrumented DB2 and 
MySQL so that they would generate I/O hints and dump 
them into an I/O trace. The types of hints generated by 
these two systems are described in Figure 2. 

To generate our traces, we ran TPC-C and TPC-H 
workloads on DB2 and a TPC-H workload on MySQL. 
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Trace DB Size | DBMS Buffer Distinct | Distinct 
Name DBMS | WkLoad | (pages) Size (pages) Requests | Hint Sets Pages 
DB2_C60 DB2 TPC-C 600K 60K 37699091 164 930688 
DB2_C300 DB2 TPC-C 600K 300K 31869377 154 1320882 
DB2_C540 DB2 TPC-C 600K 540K 21863719 140 1807431 
DB2_H80 DB2 TPC-H 800K 80K 635375701 134 732905 
HB2_H400 DB2 TPC-H 800K 400K 65675204 129 732723 
DB2_H720 DB2 TPC-H 800K 720K 3077872 128 732690 
MY-_H65 | MySQL | TPC-H 328K 65K 36266735 21 167502 
MY-H98 | MySQL | TPC-H 328K 98K 16561346 21 167501 

















Figure 5: I/O Request Traces. The page sizes for the DB2 and MySQL databases were 4KB and 16KB, respectively. For the 
TPC-C workloads, the table shows the initial database size. The TPC-C database grows as the workload runs. 


TPC-C and TPC-H are well-known on-line transac- 
tion processing (TPC-C) and decision support (TPC-H) 
benchmarks. We ran TPC-C at scale factor 25. At this 
scale factor, the TPC-C database initially occupied ap- 
proximately 600,000 4KB blocks, or about 2.3 GB, in the 
storage system. The TPC-C workload inserts new items 
into the database, so the database grows during the TPC- 
Crun. For the TPC-H experiments, the database size was 
approximately 3.2 GB for the DB2 runs, and 5 GB for the 
MySQL runs. The DB2 TPC-H workload consisted of 
the 22 TPC-H queries and the two refresh updates. The 
workload for MySQL was similar except that it did not 
include the refresh updated and we skipped one of the 
22 queries (Q18) because of excessive run-time on our 
MySQL configuration. 

On each run, we controlled the size of the database 
system’s internal buffer cache. We collected traces using 
a variety of different buffer cache sizes for each DBMS. 
We expect the buffer cache size to be a significant pa- 
rameter because it affects the temporal locality in the 
I/O request stream that is seen by the underlying stor- 
age server. Figure 5 summarizes the I/O request traces 
that were used for the experiments reported here. 


6.1 Comparison to Other Caching Policies 


In our first experiment, we compare the cache read hit ra- 
tio of CLIC to that of other replacement policies that we 
consider (LRU, ARC, TQ, and OPT). We varied the size 
of the storage server buffer cache, and we present the 
read hit ratio as a function of the server’s buffer cache 
size for each workload. For these experiments, we set 
r = 1.0 and the size of CLIC’s outqueue (Noutq) to 5 en- 
tries per page in the storage server’s cache. If the cache 
holds C’ pages, this means that CLIC tracks the most re- 
cent reference for 6C’ pages, since it tracks this infor- 
mation for all cached pages, plus those in the outqueue. 
For each tracked page, CLIC records a sequence num- 
ber and a hint set. If each of these is stored as a 4-byte 
integer, this represents a space overhead of roughly 1%. 
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To account for this, we reduced the server cache size by 
1% for CLIC only, so that the total space used by CLIC 
would be the same as that used by other policies. ARC 
also employs a structure similar to CLIC’s outqueue for 
tracking pages that are not in the cache. However, we 
did not reduce ARC’s cache size. As a result, ARC has a 
small space advantage in these experiments. 


Figure 6 shows the results of this experiment for the 
DB2 TPC-C traces. All of the algorithms have similar 
performance for the DB2_C60 trace. That trace comes 
from the DB2 configuration with the smallest buffer 
cache, and there is a significant amount of temporal lo- 
cality in the trace that was not “absorbed” by DB2’s 
cache. This temporal locality can be exploited by the 
storage server cache. As a result, even LRU performs 
reasonably well. Both of the hint-based algorithms (TQ 
and CLIC) also do well. 


The performance of LRU is significantly worse on 
the other two TPC-C traces, as there is very little tem- 
poral locality. ARC performs better than LRU, as ex- 
pected, though substantially worse than both of the hint- 
aware policies. CLIC, which learns how to exploit the 
available hints, does about as well as TQ, which imple- 
ments a hard-coded response to one particular hint type 
on the DB2_C300 trace, and both policies’ performance 
approaches that of OPT. CLIC outperforms TQ on the 
DB2_C540 trace, though it is also further from OPT. The 
DB2_C540 trace comes from the DB2 configuration with 
the largest buffer cache, so it has the least temporal local- 
ity of all traces and therefore presents the most difficult 
cache replacement problem. 


Figures 7 and 8 show the results for the TPC-H traces 
from DB2 and for the MySQL TPC-H traces, respec- 
tively. Again, CLIC generally performs at least as well 
as the other replacement policies that we considered. In 
some cases, e.g., for the DB2_H400 trace, CLIC’s read 
hit ratio is more than twice the hit ratio of the best hint- 
oblivious alternative. In one case, for the DB2_H80 trace 
with a server cache size of 300K pages, both LRU and 
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Figure 6: Read Hit Ratio of Caching Policies for the DB2 
TPC-C Workloads 


ARC outperformed both TQ and CLIC. We are uncertain 
of the precise reason for this inversion. However, this is 
a scenario in which there is a relatively large amount of 
residual locality in the workload (because the DB2 buffer 
cache is small) and in which the storage server cache may 
be large enough to capture it. 


6.2 Tracking Only Frequent Hint Sets 


In this experiment, we study the effect of tracking only 
the most frequently occurring hint sets using the top-k 
algorithm described in Section 5. In our experiment we 
vary k, the number of hint sets tracked by CLIC, and 
measure the server cache hit ratio. 

Figure 9 shows some of the results of this experiment. 
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Figure 7: Read Hit Ratio of Caching Policies for the DB2 
TPC-H Workloads 


The top graph in Figure 9 shows the results for the DB2 
TPC-C traces, with a server cache size of 180K pages. 
We obtained similar results with the DB2 TPC-C traces 
for other server cache sizes. In all cases, tracking the 
20 most frequent hints (i.e., setting k = 20) was suffi- 
cient to achieve a read hit ratio close to what we could 
obtain by tracking all of the hints in the trace. In many 
cases, tracking fewer than 10 hints sufficed. The curve 
for the DB2_-C540 trace illustrates that the Space Sav- 
ing algorithm that we use to track frequent hint sets can 
sometimes suffer from some instability, in the sense that 
larger values of & may result in worse performance than 
smaller k. This is because hint sets reported by the Space 
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Figure 8: Read Hit Ratio of Caching Policies for the 
MySQL Workloads 


Saving algorithm when k = kj are not guaranteed to be 
reported by the space saving algorithm when k > ky. 
We only observed this problem occasionally, and only 
for very small values of k. 

The lower graph in Figure 9 shows the results for the 
DB2 TPC-H traces, with a server cache size of 180K 
pages. For all of the DB2 TPC-H traces and all of the 
cache sizes that we tested, k = 10 was sufficient to ob- 
tain performance close to that obtained by tracking all 
hint sets. For the MySQL TPC-H traces (not shown in 
Figure 9), which contained fewer distinct hint sets, k = 4 
was sufficient to obtain good performance. Overall, we 
found the top-k approach to be very effective at cutting 
down the number of hints to be considered by CLIC. 


6.3 Increasing the Number of Hints 


In the previous experiment, we studied the effectiveness 
of the top-k approach at reducing the number of hints 
that must be tracked by CLIC. In this experiment, we 
consider a similar question, but from a different perspec- 
tive. Specifically, we consider a scenario in which CLIC 
is subjected to useless “noise” hints, in addition to the 
useful hints that it has exploited in our previous experi- 
ments. We limit the number of hint sets that CLIC is able 
to track and increase the level “noise”. Our objective is 
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Figure 9: Effect of Top-K Hint Set Filtering on Read Hit 
Ratio 


to determine whether the top-k approach is effective at 
ignoring the noise, and focusing the limited space avail- 
able for hint-tracking on the most useful hints. 


In practice, we hope that storage clients will not gen- 
erate lots of useless hints. However, in general, clients 
will not be able to determine how useful their hints are 
to the server, and some hints generated by clients may be 
of little value. By deliberately introducing a controllable 
level of useless hints in the his experiment, we hope to 
test CLIC’s ability to tolerate them without losing track 
of those hints that are useful.. 


For this experiment we used our DB2 TPC-C traces, 
each of which contains 5 real hint types, and added T’ 
additional synthetic hint types. In other words, each re- 
quest will have 5 + T hints associated with it, the five 
original hints plus 7’ additional synthetic hints. Each in- 
jected synthetic hint is chosen randomly from a domain 
of D possible hint values. A particular value from the 
domain is selected using a Zipf distribution with skew 
parameter z = 1. When T > 1, each injected hint value 
is chosen independently of the other injected hints for 
the same record. Since the injected hints are chosen at 
random, we do not expect them to provide any informa- 
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tion that is useful for server cache management. This 
injection procedure potentially increases the number of 
distinct hint sets in a trace by a factor D?. For our ex- 
periments, we chose D = 10, and we varied 7’, which 
controls the amount of “noise”’. 

Figure 10 shows the read hit ratios in a server cache of 
size 180K pages as a function of T. We fixed k = 100 
for the top-& algorithm, so the number of hints tracked 
by CLIC remains fixed at 100 as the number of useless 
hints increases. As T’ goes from 0 to 3, the total number 
of distinct hint sets in each trace increases from just over 
100 (the number of distinct hint sets each TPC-C trace), 
to about 1000 when 7’ = 1, and to more than 50000 when 
T=3. 

Ideally, the server cache read hit ratio would remain 
unchanged as the number of “noise” hints is increased. 
In practice, however, this is not the case. As shown in 
Figure 10, CLIC fares reasonably well for the DB2_C60 
trace, suffering mild degradation in performance for T’' > 
2. However, for the other two traces, CLIC experienced 
more substantial degradation, particularly for T > 2. 
The cause of the degradation is that high-priority hint 
sets from the original trace get “diluted” by the additional 
noise hint types. For example, with D = 10 and T = 2, 
each original hint set is split into as many as D? = 100 
distinct hint sets because of the additional noise hints that 
appear with each request. Since CLIC has limited space 
for tracking hint sets, the dilution eventually overwhelms 
its ability to track and identify the useful hints. 

This experiment suggests that it may be necessary to 
tune or modify CLIC to ensure that it operates well in 
situations in which the storage clients provide too many 
low-value hints. One way to address this problem is to 
increase k as the number of hints increases, so that CLIC 
is not overwhelmed by the additional hints. Controlling 
this tradeoff of space versus accuracy is an interesting 
tuning problem for CLIC. An alternative approach is to 
add an additional mechanism to CLIC that would allow 
it to group similar hint sets together, and then track re- 
reference statistics for the groups rather than the individ- 
ual hint sets. We have explored one technique for doing 
this, based on decision trees. However, both the deci- 
sion tree technique and the tuning problem are beyond 
the scope of this paper, and we leave them as subjects for 
future work. 


6.4 Multiple Storage Clients 


One desirable feature of CLIC is that it should be capable 
of accommodating hints from multiple storage clients. 
The clients independently send their different hints to 
the storage server without any coordination among them- 
selves, and CLIC should be able to effectively prioritize 
the hints to get the best overall cache hit ratio. 

To test this, we simulated a scenario in which multiple 
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Figure 10: Effect of Number of Additional Hint Types 
on Read Hit Ratios 


instances of DB2 share a storage server. Each DB2 in- 
stance manages its own separate database, and represents 
a separate storage client. All of the databases are housed 
in the storage server, and the storage server’s cache must 
be shared among the pages of the different databases. 
To create this scenario, we create a multi-client trace for 
our simulator by interleaving requests from several DB2 
traces, each of which represents the requests from a sin- 
gle client. We interleave the requests in a round robin 
manner, one from each trace. We truncate all traces to 
the length of the shortest trace being interleaved to elim- 
inate bias towards longer traces. We treat the hint types 
in each trace as distinct, so the total number of distinct 
hint sets in the combined trace is the sum of the number 
of distinct hint sets in each individual trace. 


Figure 11 shows results for the trace generated by 
interleaving the DB2_C60, DB2-C400, and DB2_C540 
traces. The server cache size is 180K pages, and CLIC 
uses top-k filtering with 4 = 100. The figure shows the 
read hit ratio for the requests from each individual trace 
that is part of the interleaved trace. The figure also shows 
the overall hit ratio for the entire interleaved trace. For 
comparison, the figure shows the hit ratios for the full- 
length (untruncated) traces when they use independent 
caches of size 60K pages each (i.e., the storage server 
cache is partitioned equally among the clients). The fig- 
ure shows a dramatic improvement in hit ratio for the 
DB2_C60 trace and also an improvement in the overall 
hit ratio as compared to equally partitioning the server 
cache among the traces. CLIC is able to identify that 
the DB2_C60 trace presents the best caching opportuni- 
ties (since it has the most temporal locality), and to fo- 
cus on caching pages from this trace. This illustrates that 
CLIC is able to accommodate hints from multiple storage 
clients and prioritize them so as to maximize the overall 
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Figure 11: Read Hit Ratio with Three Clients 
Read hit ratio is near zero for the DB2_C300 and DB2_C540 
traces in the 180K page shared cache, so bars are not visible. 


hit ratio. 

Note that it is possible to consider other objectives 
when managing the shared server cache. For exam- 
ple, we may want to ensure fairness among clients or to 
achieve certain quality of service levels for some clients. 
This may be accomplished by statically or dynamically 
partitioning the cache space among the clients. In CLIC, 
the objective is simply to maximize the overall cache 
hit ratio without considering quality of service targets or 
fairness among clients. This objectives results in the best 
utilization of the available cache space. Our experiment 
illustrates that CLIC is able to achieve this objective, al- 
though the benefits of the server cache may go dispro- 
portionately to some clients at the expense of others. 


7 Related Work 


Many replacement policies have been proposed to im- 
prove on LRU, including MQ [22], ARC [13], CAR [3], 
and 2Q [10]. These policies use a combination of re- 
cency of use and frequency of use to make replacement 
decisions. They can be used to manage a cache at any 
level of a cache hierarchy, though some, like MQ, were 
explicitly developed for use in second-tier caches, for 
which there is little temporal locality in the workload. 
ACME [1] is a mechanism that can be used to automat- 
ically and adaptively choose a good policy from among 
a pool of candidate policies, based on the recent perfor- 
mance of the candidates. 

There are also caching policies that have been pro- 
posed explicitly for second (or lower) tier caches in a 
cache hierarchy. Chen et al [6] have classified these 
as either hierarchy-aware or aggressively collaborative. 
Hierarchy-aware methods specifically exploit the knowl- 
edge that they are running in the second tier, but they are 
transparent to the first tier. Some such approaches, like 
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X-RAY [2], work by examining the contents of requests 
submitted by a client application in the first tier. By as- 
suming a particular type of client and exploiting knowl- 
edge of its behavior, X-RAY can extract client-specific 
semantic information from I/O requests. This informa- 
tion can then be used to guide caching decisions at the 
server. X-RAY has been proposed for file system clients 
[2] and DBMS clients [17]. 

Aggressively collaborative approaches require 
changes to the first tier. Examples include PROMOTE 
[8] and DEMOTE [19], both of which seek to maintain 
exclusivity among caches, and hint-based techniques, 
including CLIC. Although all aggressively collaborative 
techniques require changes to the first tier, they vary 
considerably in the intrusiveness of the changes that 
are required. For example, ULC [9] gives complete 
responsibility for management of the second tier cache 
to the first tier. PROMOTE [8] prescribes a replacement 
policy that must be used by all tiers, including the first 
tier. This may be undesirable if the first tier cache is 
managed by a database system or other application 
which prefers an application-specific policy for cache 
management. Among the aggressively collaborative 
techniques, hint-based approaches like CLIC are ar- 
guably the least intrusive and least costly. Hints are 
small and can be piggybacked onto I/O requests. More 
importantly, hint-based techniques do not require any 
changes to the policies used to manage the first tier 
caches. 

Several hint-based techniques have been proposed, in- 
cluding importance hints [6] and write hints [11], which 
have already been described. In their work on informed 
prefetching and caching, Patterson et al [16] distin- 
guished hints that disclose from hints that advise, and 
advocated the former. Most subsequent hint-based tech- 
niques, including CLIC, use hints that disclose. In- 
formed prefetching and caching rely on hints that dis- 
close sequential access to entire files or to portions of 
files. Karma [21] relies on application hints to group 
pages into “ranges”, and to associate an expected access 
pattern with each range. Unlike CLIC, all of these tech- 
niques are they are designed to exploit specific types of 
hints. As was discussed in Section 1, this makes them 
difficult to generalize and combine. 

A previous study [6] suggested that aggressively col- 
laborative approaches provided little benefit beyond that 
of hierarchy-aware approaches and thus, that the loss of 
transparency implied by collaborative approaches was 
not worthwhile. However, that study only considered one 
ad hoc hint-based technique. Li et al [11] found that the 
hint-based TQ algorithm could provide substantial per- 
formance improvements in comparison to hint-oblivious 
approaches (LRU and MQ) as well as simple hint-aware 
extensions of those approaches. 


USENIX Association 


USENIX Association 


There has also been work on the problem of sharing 
a cache among multiple competing client applications 
[5, 12, 18, 20]. Often, the goal of these techniques is 
to achieve specific quality-of-service objectives for the 
client applications, and the method used is to somehow 
partition the shared cache. This work is largely orthogo- 
nal to CLIC, in the sense that CLIC can be used, like any 
other replacement algorithm, to manage the cache con- 
tents in each partition. CLIC can also used to directly 
control a shared cache, as in Section 6.4, but it does not 
include any mechanism for enforcing quality-of-service 
requirements or fairness requirements among the com- 
peting clients. 

The problem of identifying frequently-occurring items 
in a data stream occurs in many situations. Metwally 
et al [14] classify solutions to the frequent-item prob- 
lem as counter-based techniques or sketch-based tech- 
niques. The former maintain counters for certain indi- 
vidual items, while the latter collect information about 
aggregations of items. For CLIC, we have chosen to use 
the Space-Saving algorithm [14] as it is both effective 
and simple to implement. A recent study [7] found the 
Space-Saving algorithm to be one of the best overall per- 
formers among frequent-item algorithms. 


8 Conclusion 


We have presented CLIC, a technique for managing a 
storage server cache based on hints from storage client 
applications. CLIC provides a general, adaptive mech- 
anism for incorporating application-provided hints into 
cache management. We used trace-driven simulation to 
evaluate CLIC, and found that it was effective at learn- 
ing to exploit hints. In our tests, CLIC learned to per- 
form as well as or better than TQ, an ad hoc hint based 
technique. In many scenarios, CLIC also performed sub- 
stantially better than hint-oblivious techniques such as 
LRU and ARC. Our results also show that CLIC, unlike 
TQ and other ad hoc techniques, can accommodate hints 
from multiple client applications. 

A potential drawback of CLIC is the space overhead 
that is required learning which hints are valuable. We 
considered a simple technique for limiting this over- 
head, which involves identifying frequently-occurring 
hints and tracking statistics only for those hints. In many 
cases, we found that it was possible to significantly re- 
duce the number of hints that CLIC had to track with 
only minor degradation in performance. However, al- 
though tracking only frequent hints is a good way to re- 
duce overhead, the overhead is not eliminated and the 
space required for good performance may increase with 
the number of hint types that CLIC encounters. As part 
of our future work, we are using decision trees to gener- 
alize hint sets by grouping related hint sets together into 


a common class. We expect that this approach, together 
with the frequency-based approach, can enable CLIC to 
accommodate a large number of hint types. 
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Abstract 


Clustered applications in storage area networks (SANs), 
widely adopted in enterprise datacenters, have tradition- 
ally relied on distributed locking protocols to coordi- 
nate concurrent access to shared storage devices. We 
examine the semantics of traditional lock services for 
SAN environments and ask whether they are sufficient 
to guarantee data safety at the application level. We ar- 
gue that a traditional lock service design that enforces 
strict mutual exclusion via a globally-consistent view of 
locking state is neither sufficient nor strictly necessary 
to ensure application-level correctness in the presence 
of asynchrony and failures. We also argue that in many 
cases, strongly-consistent locking imposes an additional 
and unnecessary constraint on application availability. 
Armed with these observations, we develop a set of novel 
concurrency control and recovery protocols for clustered 
SAN applications that achieve safety and liveness in the 
face of arbitrary asynchrony, crash failures, and network 
partitions. Finally, we present and evaluate Minuet- a 
new synchronization primitive based on these protocols 
that can serve as a foundational building block for safe 
and highly-available SAN applications. 


1 Introduction 


In recent years, storage area networks (SANs) have been 
gaining widespread adoption in enterprise datacenters 
[19] and are proving effective in supporting a range of 
applications across a broad spectrum of industries. Ac- 
cording to a recent survey of IT professionals across a 
range of corporations, government agencies, and uni- 
versities, the overwhelming majority (80%) have de- 
ployed a storage area network in their organizations and 
26% of the respondents report having deployed five or 
more SANs [14]. Some of the common applications 
include online transaction processing in finance and e- 
commerce, digital media production, business data ana- 
lytics, and high-performance scientific computing. 

A SAN architecture is a particularly attractive choice 
for parallel clustered applications that demand high- 
speed concurrent access to a scalable storage backend. 
Such applications commonly rely on a clustered middle- 
ware service to provide a higher-level storage abstraction 
such as a filesystem (GFS [35], OCFS [8], PanFS [10], 
GPFS [37]) or a relational database (Oracle RAC [9]) on 
top of raw disk blocks. 


One of the primary design challenges for clustered 
SAN applications and middleware is ensuring safe and 
efficient coordination of access to application state and 
metadata that resides on shared storage. The traditional 
approach to concurrency control in shared-disk clusters 
involves the use of a synchronization module called a 
distributed lock manager (DLM). Typically, DML ser- 
vices aim to provide the guarantee of strict mutual exclu- 
sion, ensuring that no two processes in the system can 
simultaneously hold conflicting locks. In abstract terms, 
providing such guarantees requires enforcing a globally- 
consistent view of lock acquisition state and one could 
argue that a traditional DLM design views such consis- 
tency as an end-in-itself rather than a means to achieving 
application-level correctness. 


In this paper, we take a close look at the semantics of 
SAN lock services and ask whether the assurances of full 
mutual exclusion and strongly-consistent locking are, in 
fact, a prerequisite for correct application behavior. Our 
main finding is that the standard semantics of mutual ex- 
clusion provided by a DLM are neither strictly necessary 
nor sufficient to guarantee safe coordination in the pres- 
ence of node failures and asynchrony. In particular, pro- 
cessing and queuing delays in SAN switches and host 
bus adapters (HBAs) expose applications to out-of-order 
delivery of I/O requests from presumed faulty processes 
which, in certain scenarios, can incur catastrophic viola- 
tions of safety and cause permanent data loss. 


We propose and evaluate a new technique for disk ac- 
cess coordination in SAN environments. Our approach 
augments target storage devices with a tiny application- 
independent functional component, called a guard, and a 
small amount of state, which enable them to reject incon- 
sistent I/O requests and provide a property called session 
isolation. 


These extensions enable a novel optimistic approach 
to concurrency control in SANs and can also make ex- 
isting lock-based protocols safe in the face of arbitrar- 
ily delayed message delivery, drifting clocks, crash pro- 
cess failures, and network partitions. The session isola- 
tion property in turn provides a foundational primitive 
for implementing more complex and useful coordina- 
tion semantics, such as serializable transactions, and we 
demonstrate one such protocol. 


We then describe the implementation of Minuet- a 
software library that provides a novel synchronization 
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primitive for SAN applications based on the protocols 
we present. Minuet assumes the presence of guard at 
the target storage devices and provides applications with 
locking and distributed transaction facilities, while guar- 
anteeing liveness and data safety in the face of arbi- 
trary asynchrony, node failures, and network partitions. 
Our evaluation shows that applications built atop Minuet 
compare favorably to those that rely on a conventional 
strongly-consistent DLM, offering comparable or better 
performance and improved availability. 

Unlike existing services for fault-tolerant distributed 
coordination such as Chubby [20] and Zookeeper [15], 
Minuet requires its lock managers to maintain only 
loosely-consistent replicas of locking state and thus per- 
mits applications to make progress with less than a ma- 
jority of lock manager replicas. To demonstrate the prac- 
tical feasibility of our approach, we implemented two 
sample applications — a distributed chunkmap and a B+ 
tree — on top of Minuet and evaluated them in a clustered 
environment supported by an iSCSI-based SAN. 

The benefits of optimistic concurrency control and the 
associated tradeoffs have been extensively explored in 
the database literature and are well understood. In par- 
ticular, techniques such as callback locking, optimistic 
2-phase locking, and adaptive callback locking [18, 21, 
24,42] have been proposed to enable safe coordination 
and efficient caching in client-server databases. It is im- 
portant to note, however, that these approaches are not 
directly applicable to SANs because they assume the 
existence of a central lock server, typically co-located 
with the data block storage server. This assumption 
does not hold in a SAN environment, where the storage 
"servers" are application-agnostic disk arrays that pos- 
sess no knowledge of locking state or node liveness sta- 
tus. Hence, a conservative DLM service that enforces 
strict mutual exclusion has traditionally been viewed as 
the only practical method of coordinating concurrent ac- 
cess to shared state for SAN applications. 

Our main insight is that a single nearly trivial exten- 
sion to the internal logic of a SAN storage device suffices 
to address the data safety problems associated with tra- 
ditional DLMs and enables a very different approach to 
protocol layering for storage access coordination. Cru- 
cially, we achieve this without introducing application- 
level logic into storage devices and without forfeiting the 
generality and simplicity of the traditional block-level in- 
terface to SAN-attached devices. 

The technical feasibility of device-based synchroniza- 
tion and its practical advantages have been demonstrated 
by several earlier proposals [12,29]. Our study builds on 
this earlier work and while prior efforts have primarily 
focused on moving the functionality of a traditional clus- 
ter lock manager into the storage device, Minuet aims to 
provide a more general and useful synchronization prim- 
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itive that supports a wider range of concurrency con- 
trol mechanisms. In addition to supporting traditional 
conservative locking, our approach enables an optimistic 
method of concurrency control that can improve perfor- 
mance for certain application workloads. Further, Min- 
uet allows existing locking protocols to remain safe in the 
presence of arbitrarily-delayed message delivery, node 
failures, and network partitions. 

The rest of this paper is organized as follows. In Sec- 
tion 2, we provide the relevant background on SAN and 
some representative examples of data safety problems. 
In Section 3, we present our main contribution - the de- 
sign of Minuet, a novel safe and highly available syn- 
chronization mechanism for SAN applications. Section 4 
describes our prototype implementation and two sample 
parallel applications. We evaluate our system in Sec- 
tion 5 and discuss practical aspects of our approach in 
Section 6. Finally, we discuss related work in Section 7 
and conclude in Section 8. 


2 Background 


2.1 Storage area networks 


Storage area networks (SANs) are popular in enterprise 
datacenters and are commonly adopted to support the 
storage needs of data-intensive clustered applications. In 
the SAN (or shared-disk) model, persistent storage de- 
vices, typically disk drive arrays or specialized hardware 
appliances, are attached to a dedicated storage network 
and appear to members of the application cluster as local 
disks. Most SANs utilize a combination of SCSI and a 
low-level transport protocol such as TCP/IP or FCP (Fi- 
bre Channel Protocol) for communication between the 
application nodes and the target storage devices. 

SANs aim to provide fully decentralized access to 
shared application state on disk and in principle, any 
SAN-attached client node can access any piece of data 
without routing its requests to a dedicated server. While 
in this model, all requests on a particular piece of data 
are centrally serialized, the crucial distinction from the 
traditional server-attached storage paradigm is that the 
point of serialization is a hardware disk controller that 
exposes an application-independent I/O interface on raw 
disk blocks and is oblivious to application semantics and 
data layout considerations. 

Broadly, the SAN paradigm is advantageous from the 
standpoint of availability because it offers better redun- 
dancy and decouples node failures from loss of persistent 
state. Incoming application requests can be routed to any 
available node in the cluster and, in the event of a node 
failure, subsequent requests can be redirected to another 
processor with minimal interruption of service. 

One of the primary design challenges for clustered 
SAN applications and middleware is ensuring safe and 
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efficient coordination of access to shared state on disk 
and commonly, a software service called a distributed 
lock manager (DLM) is employed to provide such coor- 
dination. A typical lock service such as OpenDLM [7] 
operates on shared resources, abstract application-level 
entities that require access coordination, and attempts to 
provide the guarantee of mutual exclusion - no two pro- 
cesses may simultaneously hold conflicting locks on the 
same resource. 


2.2 Safety and liveness problems in SANs 


In theory, DLM-based mutual exclusion offers sufficient 
mechanism to ensure safe access to shared state. In prac- 
tice, however, guaranteeing safe serialization of disk re- 
quests tends to be more difficult than the above discus- 
sion might suggest due to the effects of node failures and 
asynchrony: nodes can fail by stopping and the process- 
ing and communication delays are not bounded. The fol- 
lowing examples illustrate the nature of the problem. 
Scenario 1: Consider a data structure S spanning 10 
blocks on a shared disk D and two clients, Cy and C2, 
that are accessing the data structure concurrently. C) 
is updating blocks [3 — 7] of S under the protection of 
an exclusive lock, while C wants to read S$ in its en- 
tirety (ie., blocks [0 — 9]) and is waiting for a shared 
lock. Suppose C; crashes after sending its WRITE re- 
quest to D but before hearing the response. The lock 
manager correctly detects the failure, reclaims the exclu- 
sive lock, and grants it to Cp in shared mode. Next, C2 
proceeds to reading S and, assuming that a single disk 
request can carry up to 5 blocks of data, issues two re- 
quests: Ry = (READ|0—4]) and Rp = (READ[5 — 9}). 
Suppose C,’s delayed WRITE request on blocks [3 — 7] 
reaches the disk after R; but before R2, in which case 
only the latter would reflect the effects of C;’s update. 
Hence, although individual I/O requests are processed by 
D as atomic units, their inconsistent interleaving would 
cause C2 to observe and act upon a partial update from 
C,, which can be viewed as a violation of data safety. 
As an alternative to heartbeat failure detection, a lease- 
based mechanism [26] can be used to coordinate clients’ 
accesses in the above example, but precisely the same 
problematic scenario would arise when clocks are not 
synchronized. When C| crashes and its lease expires, the 
lease manager could grant it to C2 prior to the arrival of 
the last WRITE from C, to the storage target. Since the 
target does not coordinate with the lease manager, it fails 
to establish the fact that an incoming request from C| is 
inconsistent with the current lease ownership state. 
Scenario 2: Clustered applications and middleware 
services commonly need to enforce transactional seman- 
tics on updates to application state and metadata. In 
a shared-disk clustered environment, distributed trans- 
actions have traditionally been supported by two-phase 


locking in conjunction with a distributed write-ahead 
logging (WAL) protocol. In the abstract, the system 
maintains a snapshot of application state along with a set 
of per-client logs (also on shared disks) that record Redo 
and/or Undo information for every transaction along with 
its commit status. During failure recovery, the system 
must examine the suspected client’s log and restore con- 
sistency by rolling back all uncommitted updates and re- 
playing all updates associated with committed transac- 
tions that may not have been flushed to the snapshot prior 
to the failure. An essential underlying assumption here is 
that once log recovery is initiated, no additional WRITE 
requests from the suspected process will reach the snap- 
shot. A violation of this assumption could result in the 
corruption of logs and application data. 

Ensuring data safety in a shared-disk environment has 
traditionally required a set of partial synchrony assump- 
tions to allow reliable heartbeat-driven failure detection 
and/or leases. For example, lease-based mechanisms 
typically expect bounded clock drift rates and message 
delivery delays to ensure the absence of in-flight I/O re- 
quests upon lease termination. However, these assump- 
tions are probabilistic at best and since application data 
integrity is predicated on the validity of these assump- 
tions, failure timeouts must be tuned to a very conser- 
vative value to account for worst-case delays in switch 
queues and client-side buffering. Such (necessarily) pes- 
simistic timeouts may have a profoundly negative impact 
on failure recovery times - one of the common criticisms 
of SAN-oriented applications [16]. 

Another serious limitation exhibited by today’s SAN 
applications is liveness. The DLM (or lease manager) 
represents an additional point of failure and while vari- 
ous fault tolerance techniques can be applied to improve 
its availability, the very nature of the semantics enforced 
by the DLM places a fundamental constraint on the over- 
all system availability. For instance, multiple lock man- 
ager replicas can be deployed in a cluster, but mutual 
exclusion can be guaranteed only if clients’ requests are 
presented to them in a consistent order, which necessi- 
tates consensus mechanisms such as Paxos [31]. Alter- 
natively, a single lock manager instance can be elected 
dynamically [27] from a group of candidates and in 
this case, ensuring mutual exclusion necessitates global 
agreement on the lock manager’s identity. In both cases, 
reaching agreement fundamentally requires access to an 
active primary component - typically a majority of nodes. 
As a result, a large-scale node failure or a network par- 
tition that renders the primary component unavailable or 
unreachable may bring about a system-wide outage and 
complete loss of service. 

To summarize, today’s SAN applications and middle- 
ware face significant limitations along the dimensions 
of safety and liveness. At present, several hardware- 


FAST ’09: 7th USENIX Conference on File and Storage Technologies 313 


314 


assisted techniques, such as out-of-band power man- 
agement (STOMITH) [3], SAN fabric fencing [1], and 
SCSI-3 PR [11] can be employed to mitigate some of 
these issues. These mechanisms help reduce the likeli- 
hood of data corruption under common failure scenar- 
ios, but do not provide the desired assurances of safety 
and liveness in the general case and, as we would ar- 
gue, do not address the underlying problem. We observe 
that the underlying problem may be a case of capabil- 
ity mismatch between "intelligent" application processes 
that possess full knowledge of application’s data struc- 
tures, their disk layout, and consistency semantics on the 
one hand and relatively "dumb" storage devices on the 
other. The safety and liveness problems illustrated above 
can be attributed to a disk controller’s inability to identify 
and appropriately react to the various application-level 
events such as lock release, failure suspicion, and failure 
recovery action. 


3 Minuet Design 


At a high level, our approach reexamines the correctness 
criteria that a cluster DLM service must provide to appli- 
cations. Traditionally, DLMs tend to treat shared appli- 
cation resources as purely abstract entities and enforce 
the mutual exclusion property: no two clients may si- 
multaneously hold conflicting locks on the same shared 
resource. We note, however, that the mutual exclusion 
property as stated above is provably unattainable in an 
asynchronous system that is subject to even a single crash 
failure - a consequence of the impossibility of consen- 
sus [23] in such an environment. Furthermore, as we 
explain in the previous section, a hypothetical lock ser- 
vice that does offer such guarantees would not by itself 
suffice to guarantee data safety in such a setting due to 
the possibility of out-of-order I/O request delivery. 
Rather than restricting access to critical code sections, 
our approach views the access coordination problem in 
terms of I/O request ordering guarantees that the storage 
system must provide to application processes. We refer 
to this alternate notion of correctness as session isolation. 
We define this correctness property in formal terms be- 
low and then present a protocol that achieves session iso- 
lation with the help of guard logic. Finally, we demon- 
strate how distributed multi-resource transactions can be 
supported using session isolation as a building block. 


3.1 Session isolation 


Throughout this paper, we will use the term resource 
to denote the basic logical unit of concurrency control. 
Each resource R is identified by a unique and persistent 
application-level identifier (denoted R.res]D) and has 
some physical representation on a SAN-attached storage 
device, which we call its owner (R.owner). More con- 
cretely, a resource may represent a filesystem block, a 
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Figure 1: Concurrent request streams to a shared resource X from 
two client processes, C; and C2. In this example, C; first performs 
two READ operations on X under the protection of a Shared lock, 
then upgrades to Exc/ and issues two WRIT Es. Lastly, C; down- 
grades its lock to Shared and performs two more READs. Client C2 
acquires a Shared lock on X and submits a READ request, followed 
by an upgrade to Exc/ and two WRITE requests. 


database table, or an individual tuple in a table. An appli- 
cation process operates on R by issuing READ/WRITE 
commands to R.owner, as well as by acquiring and re- 
leasing locks on R.res]D. We begin by defining the no- 
tion of session to a shared resource and describing the 
session isolation criterion. 


Definition 1. [fa client process C requests a Shared lock 
on R and the request is granted by the lock service, we 
say that C establishes a shared session to R. An existing 
shared session is terminated when C releases the Shared 
lock (i.e., downgrades to None). Analogously, by acquir- 
ing an Excl lock, a client establishes an exclusive ses- 
sion to R that can subsequently be terminated by down- 
grading to Shared or None. 

We define Sessions(T ,C,R) to be the set of all sessions 
to R from C active at time T , which is determined solely 
by the sequence of C’s prior upgrade and downgrade re- 
quests to the lock service. Sessions(T ,C,R) may contain 
a shared or an exclusive session to R, or both, or none. 

We say that a shared session conflicts with every ex- 
clusive session to the same resource R and an exclusive 
session conflicts with every other session to R. 


Definition 2. /f a client process C issues at time T a 
disk request r that operates on shared resource R, we say 
that r belongs to session S if S € Sessions(T,C,R). For 
a given session S, we additionally define Requests(S) to 
be the set of all disk requests that belong to S. 


Definition 3. A given global execution history satis- 
fies session isolation with respect to R if the sequence 
of disk request messages M = (r1,rz,...) observed and 
processed in this history by R.owner satisfies: Vrj,rj € 
M such that {rj,r;} C Requests(S) for some S: Arg € 
M such thati << k < jand ry € Requests(S*) for a ses- 
sion S* from another client that conflicts with S. 





Informally, the above condition requires R.owner to 
observe the prefixes of all sessions to R in strictly se- 
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rial order, ensuring that no two requests in a client’s ses- 
sion are interleaved by a conflicting request from another 
client. To illustrate this definition, consider a pair of con- 
current request sequences shown in Figure 1. In this sce- 
nario, the following two orderings of request observa- 
tions by the owner of shared resource X would satisfy 
session isolation: 


Ey =(Ri1, R12, Wir, Wi2, R13, R14, R21, W2.1, W2.2) 
Ex =(R1.1, R12; Wi1, R21, W2.1, W2.2) 


However, an execution history that causes the owner to 
observe (R11, R21, R1.2,Wi.1,W2.1) does not obey ses- 
sion isolation because it permits Ro; and W211, two 
shared-session requests from C2, to be interleaved by 
W,.1, an exclusive-session request from C,. 

Note that session isolation is more permissive than 
strict mutual exclusion and in particular, permits ex- 
ecution histories in which two clients simultaneously 
hold conflicting locks on the same shared resource. 
At the same time, one could argue that these seman- 
tics meaningfully capture the essence of shared-disk 
locking, by which we mean that the request order- 
ing guarantees provided by session isolation are pre- 
cisely those that applications developers have come to 
expect from a traditional DLM. To see this, observe 
that in the previous example, a conventional lock ser- 
vice offering full mutual exclusion would cause X to 
observe F; by granting clients’ requests in the order 
(C1 (Shared) ,C,(Excl),C2(Shared),Co(Excl)). — Like- 
wise, FE corresponds to a possible failure scenario in 
which C crashes after acquiring its locks, causing the 
DLM to reclaim them and grant ownership to C>. 


3.2. Guard 


Our core approach is inspired by earlier work on bridg- 
ing the intelligence gap between applications and block 
storage devices [17,25], as well as earlier proposals 
for device-based synchronization [12,29]. We aug- 
ment SAN-attached disks with a small application- 
independent component, which we call a guard, that en- 
forces the session isolation invariant on the stream of in- 
coming I/O commands. We associate a session identifier 
(SID) with every client session to a shared resource and 
modify the storage protocol stack on the initiators to an- 
notate all outgoing disk commands with the current $7D 
for the respective resource. Below, we refer to this addi- 
tional state in the command header as session annotation. 

A session annotation for a disk command operating 
on R has two components: a session verifier and a ses- 
sion update, denoted by R.verifySID and R.updateSID, 
respectively. For commands that belong to an existing 
session, the verifier enables the target to confirm session 
validity prior to accepting the command and updateSID 
is used by the initiator to signal the start of a new session. 


For each shared resource R, its owner device maintains 
a local session identifier (denoted R.ownerSID) on per- 
sistent storage. Upon receipt of an I/O command from an 
initiator, the owner invokes the guard, which evaluates 
the command’s session annotation against R.ownerSID 
and determines whether session isolation would be pre- 
served by accepting the command. Functionally, the 
guard operation is a form of compare-and-set and we de- 
scribe this operation in detail in Section 3.3. 

If an incoming I/O request fails verification, the target 
drops the request from its input queue and notifies the ini- 
tiator via a special status code EBADSESSION. From an 
application developer’s point of view, session rejection 
appears as a failed disk request along with an exception 
notification from the lock service indicating that a lock 
on the respective resource is no longer valid. 

The guard situated at the target devices addresses the 
safety problems due to delayed messages and inconsis- 
tent failure observations that plague asynchronous dis- 
tributed environments and enforcing safety at the target 
device permits us to simplify the core functionality of the 
DLM module. In our scheme, the primary purpose of the 
lock service is ensuring an efficient assignment of ses- 
sion identifiers to clients that minimizes the frequency of 
command rejection for a given application workload. 

Decoupling correctness from performance in this man- 
ner enables substantial flexibility in the choice of mech- 
anism used to control the assignment of session identi- 
fiers. At one extreme is a purely optimistic technique, 
whereby every client selects its S/Ds via an independent 
local decision without attempting to coordinate with the 
rest of the cluster and this might be an entirely reason- 
able strategy for applications and workloads character- 
ized by a consistently low rate of data contention. A tra- 
ditional DLM service that serializes all session requests 
at a central lock server can be viewed as a design point 
at the other extreme. Minuet aims to position itself in the 
continuum between these extremes and allow application 
developers to trade off lock service availability, synchro- 
nization overhead, and I/O performance. 


3.3. Enforcing session isolation 


Minuet uses a simple timestamp-based mechanism to 
guarantee the session isolation invariant. A client’s ses- 
sion to a given resource R is identified by a value pair 
(T;,T;) specifying a shared and an exclusive timestamp, 
respectively. These timestamps are globally unique - no 
two sessions from distinct clients are identified using the 
same pair of values and no client is assigned the same 
value pair twice. To ensure global uniqueness, we use the 
following timestamp format: (T.incNum.clntID), where 
clntID uniquely identifies the client process and incNum 
is the client’s incarnation number - a monotonic counter 
ensuring uniqueness across crashes. 
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Figure 2: Protocol messages and per-resource state at application 
clients, lock managers, and shared storage devices. 


Client-side state: | For each shared resource R, a 
client C maintains a pair of session identifiers for 
its shared and exclusive sessions to R, denoted by 
R.sharedSID and R.exclSID, respectively. Addition- 
ally, R.curSTy pe identifies the current session type, one 
of {None,Shared,Excl}, and R.contSType holds the 
client’s session continuation type. The latter value is 
used by the target device to verify (prior to executing 
a request from C) that its existing session has not been 
broken by a conflicting request from another client. Fi- 
nally, every client C maintains an estimate of the largest 
shared and exclusive timestamp values previously as- 
signed to a session identifier for any client, which we de- 
note by R.maxT; and R.maxT,.. Initially, R.sharedSID = 
R.exclSID = NIL, R.curSType = R.contSTy pe = None, 
and R.maxT, = R.maxT,, = 0. The steps and states of the 
basic locking and storage access protocols are illustrated 
in Figure 2. 


Acquiring locks: To acquire/upgrade a lock on resource 
R, a client C proposes a unique session timestamp pair 
(proposedT;, proposedT,) to the lock manager. To ac- 
quire a Shared lock on R, C sets proposedT, — R.maxT, 
and sets proposedT, to some unique timestamp greater 
than R.maxT;. The client then sends an U pgradeLock 
request to the lock manager, specifying the desired mode 
(Shared) and the proposed timestamp pair. The lock 
manager accepts and enqueues this request if no request 
with a larger proposedT, value has been accepted. Oth- 
erwise, the manager denies the request and responds with 
U pgradeDenied, which includes the largest timestamp 
values observed by the manager. In the latter case, the 
client updates its local estimates (R.maxT;,R.maxT,) and 
submits a new proposal. After accepting and enqueuing 
C’s request, the lock manager eventually grants it and 
responds with UpgradeGranted. The client then sets 
R.curSType — Shared and initializes the shared session 
identifier: R.sharedSID — (proposedT,, proposedT;). 
To upgrade a lock from Shared to Excl, the client 
sends UpgradeLock to the lock manager after set- 
ting proposedT, — R.maxT, and proposedT, to some 
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Setting up a session annotation during I/O request submission: 
if (R.curSType = Shared) /* Shared session is active */ 
R.updateSID — R.sharedSID; 
R.verifySID.T, — NIL; R.verifySID.T,, — R.sharedSID.T,; 
else /* Exclusive session is active */ 
R.updateSID — R.exclSID; 
if (R.contSType = Shared) 
R.verifySID.T; — NIL; R.verifySID.T;, — R.sharedSID.T,; 
else 
R.verifySID — R.exclSID; 
Guard logic at the target device: 
Use the resource identifier (R.res/D) to look up R.ownerSID; 
if (R.verifySID.T, < R.ownerSID.T,) REJECT; 
if (R.verifySID.T; # NIL) 
if (R.verifySID.T, < R.ownerSID.T;) REJECT; 
if (REJECTED) 
Respond to the initiator with (EBADSESSION,R.ownerSID); 
else 
R.ownerSID.T; — Max(R.ownerSID.T; ,R.updateSID.T;); 
R.ownerSID.T, — Max(R.ownerSID.T,,R.updateSID.T, ); 
Execute the command and respond to the initiator; 





Figure 3: Disk request submission and guard logic pseudocode. 


unique timestamp greater than R.maxT,. Upon re- 
ceiving UpgradeGranted from the lock manager, the 
client sets R.curSType — Excl and R.exclSID — 
(proposedT,, proposedT,). Upgrading from None to 
Excl is functionally equivalent to acquiring a Shared 
lock and then upgrading to Excl, but as an optimization, 
these operations can be combined into a single request to 
the lock manager. 


Accessing shared storage: After establishing a session 
to R by acquiring a corresponding lock, client can pro- 
ceed to issuing disk requests that operate on the con- 
tent of R. Each outgoing request is augmented with 
a session annotation that enables the target device to 
verify proper ordering of requests and enforce session 
isolation. The annotation carries a tuple of the form 
(R.resID,R.verif ySID,R.updateSID) and is initialized 
as shown in Figure 3. 

Upon receipt of a disk request from a client, the 
owner device invokes the guard logic, which evaluates 
the session annotation as specified in Figure 3. In the 
event of rejection, the owner immediately discards the 
command and sends an EBADSESSION response to 
the client, together with a response annotation carrying 
(R.ownerSID). Otherwise, the owner executes (or en- 
queues) the command and updates its local session iden- 
tifier as shown in the figure. 

Upon receipt of an EBADSESSION status code, the 
initiator examines the response annotation and notifies 
the application process that its lock and session on R 
is no longer valid. The condition R.verifySID.T, < 
R.ownerSID.T; indicates interruption of an exclusive 
session, in which case the client downgrades its lock to 
Shared, sets (R.curST ype,R.contSType) — Shared, and 
sets R.exclSID — NIL. A Shared lock is further down- 
graded to None if R.verifySID.T, < R.ownerSID.T, 


USENIX Association 


USENIX Association 


(since in this case, a conflicting exclusive-session re- 
quest has been accepted). In this situation, the 
client sets (R.curSType,R.contSType) — None and 
R.sharedSID — NIL. In both cases, the maximum times- 
tamp estimates R.maxT, and R.maxT, are updated to re- 
flect the most recent timestamps observed by the owner. 
Upon receiving aSUCCESS status code, the client sets 
R.contSType — R.curSType and updates the shared ses- 
sion identifier to reflect the most recent value in the anno- 
tation: R.sharedSID — R.updateSID. (This step is nec- 
essary to ensure that a shared session remains valid after 
a Shared — Excl upgrade or a downgrade to Shared). 
Downgrading locks: To downgrade an existing 
lock from Excl to Shared, the client sends a Down- 
gradeLock request to the lock manager and re- 
sets the exclusive-session state: R.exclSID — NIL, 
(R.curSType,R.contSType) — Shared. Similarly, to 
downgrade from Shared to None, the client noti- 
fies the lock manager and sets R.sharedSID — NIL, 
(R.curSType,R.contST ype) — None. Upon receipt of a 
DowngradeLock request, the manager updates the own- 
ership state for R and, if possible, grants the lock to the 
next waiter in the queue. 
Correctness: The locking protocol and the guard de- 
scribed above guarantee session isolation and a formal 
correctness argument can be found in [22]. Informally, 
consider two clients C; and C2 that compete for shared 
and exclusive access to R, respectively, and suppose 
that a shared-session request from C; gets accepted with 
R.updateSID = (T} ,T}) in its annotation. Observe that 
due to global uniqueness of session proposals, the owner 
of R would subsequently accept an exclusive-session re- 
quest from C) with verifier R.verif ySID = (T2,T2) only 
if it is strictly greater than 7,!. In this case, subsequent 
shared-session requests from C; would fail verification, 
causing C; to observe EBADSESSION and downgrade 
its lock. Thus, session isolation would be preserved in 
this example via a forced termination of C,’s session. 
A similar argument demonstrates that no two exclusive- 
session commands can be interleaved by a conflicting 
command from another client. 


3.4 Supporting distributed transactions 
3.4.1 Overview and design requirements 


Transactions are widely regarded as a useful program- 
ming primitive and traditionally, SAN-oriented applica- 
tions implement transactional semantics using two-phase 
locking for isolation and a write-ahead logging (WAL) 
facility (sometimes referred to as journaling) for atomic- 
ity and durability. To support transactions, Minuet relies 
on these well-understood and widely-used mechanisms, 
while extending them with the use of the guard to ad- 
dress the safety problems outlined in Section 2.2. Since 


the primary focus of this paper is feasibility of safe and 
highly-available applications in SANs rather than perfor- 
mance, we provide only a subset of features typically 
found in a state-of-the-art transaction service such as D- 
ARIES [38]. Below, we present a design that implements 
Redo-only logging to support the "no force no steal" 
buffer policy and currently, our design permits only one 
active transaction per process at a time - after starting a 
transaction, a client must commit or abort before initiat- 
ing the next transaction. Finally, we assume unbounded 
log space for each client. These restrictions allow us to 
focus the discussion on the novel aspects of our approach 
and we believe that additional optimizations, such as sup- 
port for Undo logging, can be easily retrofitted onto our 
scheme if necessary. The following set of requirements 
motivates our design: 

(1) Avoid introducing assumptions of synchrony re- 
quired by conventional transaction schemes for SAN 
environments. We rely on the guard at target devices 
to provide session isolation and protect the state on disk 
from the effects of arbitrarily-delayed I/O commands op- 
erating on the application data and the log. 

(2) Eliminate reliance on strongly-consistent lock- 
ing. Rather than requiring clients to coordinate concur- 
rent activity via a strongly-consistent DLM, the guard at 
storage devices enables a limited form of isolation and 
permits us to relax the degree of consistency required 
from the lock service. Prior to committing a transaction, 
a client process in Minuet issues an extra disk request, 
which verifies the validity of all locks acquired at the start 
of the transaction. This mechanism allows us to identify 
and resolve cases of conflicting access due to inconsis- 
tent locking state at commit time and can be viewed as a 
variant of optimistic concurrency control - a well-known 
technique from the DBMS literature [30]. 

(3) Avoid enforcing a globally-consistent view of 
process liveness. Rather than relying on a group mem- 
bership service to detect client failures and initiate log 
recovery proactively in response to perceived failures, 
our design explores a /azy approach to transaction recov- 
ery that postpones the recovery action until the affected 
data is accessed. This enables Minuet to operate without 
global agreement on group membership. 


3.4.2. Basic transaction protocol 


Minuet stores transaction Redo information in a set of 
per-client logs on shared disks. The physical location 
of a client’s log can be computed from its client identi- 
fier (c/lntID). These logs appear to Minuet’s transaction 
module as regular lockable resources that can be read 
from and written to, while the guard is assumed to en- 
force session isolation in the event of concurrent access 
from multiple clients. 

To support transactions, we extend the basic session 
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isolation machinery described in Section 3.3 with an ad- 
ditional piece of state called a commit session identifier 
(CSID) of the form (cintID, xactID). We extend the for- 
mat of a session annotation to include two commit ses- 
sion identifiers, denoted verifyCSID and updateCSID, 
and both are set to NM/L unless specified otherwise. 
For each shared resource R, the owner device main- 
tains a local commit session identifier (R.ownerCSID) 
as well as R.ownerSID. Conceptually, the value of 
R.ownerCSID at a particular point in time identifies the 
most recent transaction that may have updated R and 
committed without flushing its changes to the disk im- 
age of R. If R.ownerCSID ¥ NIL, the current state of 
R on disk may be missing updates from a committed 
transaction and thus cannot be assumed valid. In this 
case, R.ownerCSID.clntID identifies the client process 
responsible for the latest transaction on R and it is used 
to locate the corresponding log for recovery purposes. 

Upon receiving a disk request, the guard ex- 
amines the annotation and rejects the request if 
R.verifyCSID.clntID # R.ownerCSID.clntID or if 
R.verifyCSID.xactID < R.ownerCSID.xactID. A re- 
quest is accepted only if its verifySID and verifyCSID 
both pass verification and upon completing the request, 
the owner device updates its local commit session iden- 
tifier by setting R.ownerCSID — R.updateCSID. If ver- 
ification fails, the owner responds with EBADSESSION 
and attaches the tuple (R.ownerSID,R.ownerCSID) ina 
response annotation. 

In Minuet, transactions proceed in five stages: Be- 
gin, Read, Update, Prepare, and Commit and we illus- 
trate them using high-level pseudocode in [22]. During 
one-time client initialization, Minuet’s transaction ser- 
vice locks the local client’s log in Excl mode. To be- 
gin a new transaction T,, the client selects a new transac- 
tion identifier (curX actID) via a monotonic local counter 
and appends a BeginXact record to its log. Next, in the 
Read phase of a transaction, the application process ac- 
quires a Shared lock on every resource in T.readSet and 
reads the corresponding data from shared disks into local 
memory buffers. In the Update phase that follows, the 
client acquires Excl locks on the elements of T.writeSet, 
applies the desired set of updates locally, and commu- 
nicates a description of updates to Minuet’s transaction 
service, which appends the corresponding set of U pdate 
records to the log. Each such record describes an atomic 
mutation on some resource in T .writeSet and essentially 
stores the parameters of a single disk WRITE command. 

The Prepare phase serves a dual purpose: to ver- 
ify the validity of client’s sessions (and hence, the ac- 
curacy of cached data) and to lock the elements of 
the write set in preparation for committing. For each 
resource in T.readSet UT.writeSet, the client sends a 
special PREPARE request to its owner. Minuet im- 
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plements PREPARE requests as zero-length READs, 
whose sole purpose is to transport an annotation and 
invoke the guard. PREPARE requests for elements of 
T.writeSet carry verif yCSID = NIL and updateCSID = 
(C,curXactID) in their annotations, where C is the 
client’s identifier. If all PREPARE requests return 
SUCCESS, the transaction enters the final Commit phase, 
in which a CommitXact record is force-appended to 
client C’s log. 

The protocol outlined above ensures transaction iso- 
lation, identifying cases of conflicting access during the 
Prepare phase. Recall, however, that under the session 
isolation semantics, any I/O command, including oper- 
ations on the log, may fail with EBADSESSION due to 
conflicting access from other clients. This gives rise to 
several exception cases at various stages of transaction 
execution. For example, a client C may receive an error 
while forcing CommitXact to disk due to loss of session 
to the log. This can happen only if another process has 
initiated log recovery onC and hence, the active transac- 
tion must be aborted. Other failure cases and the corre- 
sponding recovery logic are described in the report [22]. 
Syncing updates to disk: After committing a 
transaction, a client C flushes its locally-buffered 
updates to R simply by issuing the corresponding 
sequence of WRITE commands to its owner de- 
vice. Each such command specifies in its annotation 
{R.verifyCSID,R.updateCSID} = (C,syncXactID), 
where syncXactID denotes C’s most recent committed 
transaction that modified R. After flushing all committed 
updates, C issues an additional zero-length WRITE re- 
quest, which specifies R.verifyCSID = (C,syncX actID) 
and R.updateCSID = NIL in the annotation. This 
request causes the device to reset R.ownerCSID to NIL, 
effectively marking the disk image of R as "clean". 
Lastly, C appends to its log an U pdateSynced record of 
the form (R, syncX actID). 

Lazy transaction recovery: A client C can initiate 
transaction recovery when its disk command on some 
resource R fails with EBADSESSION and a non-NIL 
value ownerCSID = (Cr ,xactID) is specified in the re- 
sponse annotation. This response indicates that the disk 
image of R may be missing updates from a transaction 
committed earlier by another client Cr. If C suspects 
that Cr has failed, it invokes a local recovery procedure 
that tries to repair the disk image of R. First, C ac- 
quires exclusive locks on R and Cr .Log and reads the 
log from disk. Next, C searches the log for the most re- 
cent transaction that has successfully flushed its updates 
to R, from which it determines the list of subsequent 
committed updates that may be missing from the disk 
image. The client then proceeds to repairing the state of 
R on disk by reapplying these updates and all WRITE 
requests sent to the owner during this phase specify 
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{R.verif yCSID,R.updateCSID} = R.ownerCSID in the 
annotation. Finally, after reapplying all missing up- 
dates, C completes recovery by issuing a zero-length 
WRITE annotated with R.verifyCSID = R.ownerCSID, 
R.updateCSID = NIL. A more detailed discussion of 
transaction recovery in Minuet can be found in [22]. 


3.5 Lock manager replication 


Some lock services seek to achieve fault tolerance by 
replicating lock managers. Since Minuet does not need 
to provide assurances of mutual exclusion, it relies on a 
simpler and more available replication scheme that per- 
mits clients to retain progress in the face of extensive 
node and connectivity failures. A lock can be acquired 
as long as at least one manager instance is reachable. In 
an extreme case, that instance can be the local Minuet 
client itself, which would simply grant its own proposals 
without coordinating with other processes. 

To support manager replication, we extend the ba- 
sic locking protocol presented in Section 3.3 as follows: 
When acquiring or upgrading a lock, a client selects a 
subset of managers, which we call its voter set, and sends 
an U pgradeLock request to all members of this set. The 
lock is considered granted once U pgradeGranted votes 
are collected from all members. If any of the voters re- 
spond with U pgradeDenied due to an outdated times- 
tamp, the client downgrades the lock on all members 
that have responded with U pgradeGranted, updates its 
maxT; and maxT;, values, and resubmits the upgrade re- 
quest with a new timestamp proposal. As a performance 
optimization, we allow U pgradeLock requests to specify 
an implicit downgrade for an earlier timestamp. 


4 Implementation 


We have implemented a proof-of-concept prototype of 
Minuet based on the design presented in the preceding 
section. The prototype has been implemented on the 
Linux platform using C/C++ and consists of a client-side 
library, a lock manager process, an iSCSI protocol stack 
extension, and two sample parallel applications. 


4.1 Core Minuet modules 


Client-side library (5,440 LoC): The client-side 
component is implemented as a statically-linked library 
and provides an event-driven interface to Minuet’s core 
services, which include locking, remote disk I/O, and 
transaction execution. When requesting a lock, a client 
can optionally specify the desired size of the voter set, 
which enables application developers to tune the de- 
gree of locking consistency, enabling a choice between 
optimism and strict coordination. A small voter set 
works well for low-contention resources; it helps keep 
the lock message overhead low and permits clients to 
make progress in a partitioned network. Conversely, a 


large voter set requires connectivity to more manager 
replicas, but reduces the rate of I/O rejection under high 
contention. All outgoing disk commands are augmented 
with session annotations and in the event of rejection by 
the target device, a ForcedDowngrade event is posted 
to inform the application that the corresponding lock has 
been downgraded to some weaker mode. 


Minuet lock manager (4,285 LoC): The lock man- 
ager process grants and revokes locks using the times- 
tamp mechanism of Section 3.3 and several manager 
replicas can be deployed for fault tolerance. For each 
lockable resource, the manager maintains the current 
lock mode, the list of current holders, the queue of 
blocked upgrade requests, and the largest observed 
timestamp proposal. 


SAN protocols and guard logic: To demonstrate the 
practicality of our approach, we implemented the guard 
logic and session annotations within the framework of 
iSCSI [4], a widely-used transport for IP-based SANs, 
and our prototype extends an existing software-based 
implementation of the iSCSI standard. On application 
client nodes, we modified the top and the bottom levels of 
the 3-tier Linux SCSI driver model. The top-level driver 
(/drivers/scsi/sd.c) presents the abstraction of a generic 
block device to the kernel and converts incoming block 
requests into SCSI commands. We extended sd with a 
new ioctl call, which enables the Minuet client library to 
specify session annotations for outgoing requests and to 
collect response annotations. 

The bottom-level driver implements a TCP encapsu- 
lation of SCSI and our current prototype builds on the 
Open-iSCSI Initiator driver [6] v2.0-869.2. We used the 
additional header segment (AHS) feature of iSCSI to at- 
tach Minuet annotations to command PDUs and defined 
anew AHS type for this purpose. 

Our storage backend is based on the iSCSI Enterprise 
Target driver [5] v0.4.16, which exposes a local block 
device to remote initiators via iSCSI. We extended it 
with the guard logic, which examines incoming PDUs 
and makes an accept/reject decision based on the anno- 
tation. Command rejection is signaled to the initiator via 
the REJECT PDU defined by the iSCSI standard. 

The addition of guard logic represents the most sub- 
stantial extension to the SAN protocol stack, but incurs 
only a modest increase in the overall complexity. The ini- 
tial implementation of the Enterprise Target driver con- 
tained 14,341 lines of code and augmenting it with Min- 
uet guard logic required adding 348 lines. 


4.2 Sample applications 


Distributed chunkmap (342 LoC): Our first applica- 
tion implements a read-modify-write operation on a dis- 
tributed data structure comprised of a set of fixed-length 
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data chunks. It mimics atomic mutations to a distributed 
chunkmap - a common scenario in clustered middleware 
such as filesystems and databases. The chunkmap could 
represent a bitmap of free disk blocks, an array of i-node 
structures, or an array of directory file slots. In each it- 
eration, the application selects a random chunk, reads it 
from shared disk, modifies a random chunk region, and 
writes it back to disk. To ensure update atomicity, the 
application acquires an Excl lock on the respective block 
from Minuet prior to reading it from disk and releases 
the lock after writing back the updated version. 


Distributed B-Tree (3,345 LoC): To demonstrate the 
feasibility of serializable transactions, we implemented a 
distributed B-link tree [32] (a variant of B+ tree) on top 
of Minuet. Our implementation provides transactional 
insert, delete, update, and search operations based on 
the protocol presented in Section 3.4.2. For each oper- 
ation, the application initiates a transaction and fetches 
the chain of tree blocks necessary for the operation 
(Read phase). Next, it upgrades the locks on the mod- 
ified blocks to Excl mode and logs the updates (Up- 
date phase). Lastly, the client Prepares and Commits the 
transaction. If a transaction aborts due to loss of session 
to a tree block or the client’s log, the application reac- 
quires the corresponding lock and retries (without back- 
off) until it commits successfully. For efficiency, clients 
retain Shared locks (and the content of cache buffers) 
across transactions and stale cache entries are detected 
and invalidated during the Prepare phase. 


5 Evaluation 


In this section, we evaluate the performance of our appli- 
cations under different modes of locking. Due to space 
constraints, we present only key results that demonstrate 
the benefits of optimistic coordination enabled by Minuet 
and confirm the feasibility of our design. Several addi- 
tional important measurements are reported in [22]. 


5.1 Experimental setup 


For our experiments, we emulated a 39-node SAN en- 
vironment interconnected via 100Mbps links using Em- 
ulab [41] and detailed hardware specifications are pro- 
vided in Figure 4. Three of the nodes were configured to 
serve as Minuet lock managers and four additional nodes 
were used to emulate SAN-attached target devices, col- 
lectively providing 2GB of logical disk space, equally 
striped across the nodes. The remaining 32 nodes were 
configured as application clients. We ran each iteration 
of the experiment for 5 minutes and all of the values re- 
ported below are averages over 3 iterations. 

In each iteration, we measure the aggregate goodput 
(the number of successful application-level operations 
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The number of storage targets 1 2 3 4 
4124 
411.7 
105.9 | 220.9 | 331.1 | 410.6 
Table 1: Chunkmap application goodput (in operations per sec- 
ond) under the uniform workload. 


strong(1) coordination 
strong(2) coordination 
weak-own coordination 





per second) from all nodes and the rate of disk command 
rejection under the following locking configurations: 
strong(x): We deploy a total of 2x — | lock managers and 
require clients to obtain permissions from a majority (x). 
Note that strong(/) represents a traditional locking pro- 
tocol with a single central lock manager, while strong(2) 
requires 3 lock manager replicas and masks one failure. 
weak-own: An extreme form of weakly-consistent lock- 
ing. Each client obtains permissions only from the local 
lock manager (co-located on the same machine) and does 
not attempt to coordinate with the other clients. 

In all of our experiments, applications rely on Minuet 
to provide both modes of locking and do not make use of 
any other synchronization facilities. 


5.2 Distributed chunkmap 


In this experiment, we configured the chunk size to 8KB 
(for a total of 250K chunks) and ran the chunkmap ap- 
plication with 32 clients, varying the number of storage 
targets from 1 to 4. We considered two forms of work- 
load: 

uniform: In each operation, a chunk to be modified is 
selected uniformly at random. 

hotspot(x): x% of operations touch a hotspot region of 
the chunkmap constituting 0.1% of the entire dataset. 

Table | reports the aggregate goodput under the uni- 
form workload, which represents a low-contention sce- 
nario. The goodput exhibits linear scaling with the num- 
ber of storage servers. Further, there is no measurable 
difference in performance between the three locking con- 
figurations. These results suggest that the optimistic 
method of coordination enabled by Minuet does not ad- 
versely affect application performance, while providing 
safety, in scenarios where the overall I/O load is high, but 
contention for a single resource is relatively rare. 

The rate of I/O rejection increases when the workload 
has hotspots and, as expected, weak-own suffers a per- 
formance hit proportional to the popularity of the hotspot 
(Figure 5). We note that the hotspot workload represents 
a very stressful case (the hotspot size is 0.1%) and our 
results demonstrate that weakly-consistent locking de- 
grades gracefully and can still provide reasonable per- 
formance in such scenarios. 

We also ran experiments in a partitioned network sce- 
nario, where each client can communicate with only a 
subset of replicas. A strongly-consistent locking proto- 
col demands a well-connected primary component con- 
taining at least a majority of manager replicas - a condi- 
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Figure 4: Hardware specifications of the 
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Figure 6: Pre-populated B+ tree parame- 
ters. 


tion that our partitioned scenario fails to satisfy. As a re- 
sult, no client can make progress with traditional strong 
locking and the overall application goodput is zero. In 
contrast, under Minuet’s weak locking, clients can still 
make good progress. This demonstrates the availability 
benefits that Minuet gains over a traditional DLM design. 


5.3 Distributed B+ tree 


The B+ tree application demonstrates Minuet’s support 
for serializible transactions. In this experiment, we start 
with a pre-populated tree and run the application for 5 
minutes on 32 client nodes. Each client inserts a series 
of randomly-generated keys and we measure the aggre- 
gate goodput, defined as the total rate of successful inser- 
tions per second from all clients. To test Minuet’s behav- 
ior under different transaction complexity and contention 
scenarios, we used two different pre-populated B+ trees, 
whose parameters are given in Figure 6. 

Figure 7(left) compares the performance of strong(1) 
and weak-own. Under both locking schemes, the 
throughput exhibits near-linear scaling with the num- 
ber of storage targets. As expected, tree-large demon- 
strates a lower aggregate transaction rate because each 
transaction requires accessing a longer chain of nodes. 
Moreover, since the number of leaf nodes is large, read- 
write or write-write contention is relatively infrequent 
and hence, the performance penalty due to I/O rejection 
incurred under weak-own is negligible. By contrast, tree- 
small represents a high-contention workload and our re- 
sults suggest that even in this stressful scenario, Minuet’s 
weak locking incurs only a modest performance penalty. 

Further investigation revealed that the primary cause 
of the performance degradation was an outdated esti- 
mate of maximum timestamps (maxT,,maxT,), causing 
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Figure 5: Left: chunkmap goodput under the hotspot(x) workload for varying x. Right: 
the rate of rejected I/O requests under hotspot(x) for varying x. 
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Figure 7: Left: B+ tree application goodput with tree-small and tree-large datasets. 
Right: effects of the timestamp broadcast optimization with tree-small dataset. 


some of the commands to carry outdated session iden- 
tifiers (e.g., with verifySID.T, < ownerSID.T,). Under 
weak-own, clients select session identifiers without co- 
ordinating with other clients and hence, a client may not 
know the up-to-date value of ownerSID.T, that may have 
been set by an earlier transaction from another client. 

A simple optimization alleviating this issue is to let 
clients lazily synchronize their knowledge of maximum 
timestamps. More specifically, each client can broadcast 
its local updates on maxT, and maxT,, to other clients at 
some fixed broadcast rate (b) and other clients can update 
their local estimates accordingly. We implemented this 
optimization and measured its effects on the tree-small 
workload with 4 storage targets. Figure 7(right) shows 
the results, which suggest that we can substantially re- 
duce the rate of rejection by broadcasting with b > 0.2 
and the resulting goodput closely approaches the maxi- 
mum value achievable under strong locking. 

Note that this optimization affects only performance 
and is not required for safety. Conceptually, the broad- 
cast rate b provides a way of parameterizing the contin- 
uum between traditional locking and the fully optimistic 
case of weak-own and other methods may be possible. 


6 Discussion 


In this section, we discuss several issues pertaining to the 
practical feasibility of our approach and the implications 
of Minuet’s programming model. 

Storage target modifications: Our approach rests on 
the basic idea of extending SAN-attached storage targets 
with a small amount of guard logic that enables them to 
detect and filter out inconsistent I/O requests , which will 
require storage array vendors to introduce a new feature 
into their products. 
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We acknowledge that Minuet relies on functionality 
that does not presently exist in standard storage hard- 
ware and, consequently, faces non-trivial barriers to stan- 
dardization, implementation, and deployment. However, 
we observe that the proposed extensions are very incre- 
mental and can easily be retrofitted into an existing de- 
sign. The guard logic is amenable to efficient implemen- 
tation in hardware or firmware, requiring only a few table 
lookups and comparison operations. 

As we argue above, the benefits of implementing such 
an extension can be substantial. In addition to lifting 
the safety and liveness limitations that have traditionally 
characterized shared-disk applications and middleware, 
our approach establishes a new degree of freedom in the 
design space of SAN applications, enabling a choice be- 
tween optimism and strict coordination. 

Our investigation builds upon earlier work on device 
locking [12], which has demonstrated the practical fea- 
sibility of this approach and the willingness of storage 
hardware vendors to adopt a promising new feature [2]. 
Metadata storage overhead: In our prototype imple- 
mentation, target storage devices maintain 16 bytes of 
per-resource metadata. For a typical middleware service 
such as a database or a filesystem, a resource would cor- 
respond to a single fixed-length block containing appli- 
cation data or metadata and taking a clustered filesystem 
as an example, block sizes in the range 128KB - 1 MB are 
common [8]. Assuming 128KB application block size, 
the table of Minuet session identifiers for a dataset of 
size 1TB would consume an additional 128MB. 

Perhaps more alarmingly, Minuet metadata must be 
stored in random-access memory for efficient lookup on 
the data path. We envision the use of flash memory or 
battery-backed RAM for this purpose and observe that 
today, high-performance storage arrays make extensive 
use of NVRAM for asynchronous write caching [39]. 
Alternatively, the session state can be stored persistently 
on disk and a fixed-size NVRAM buffer can be used as a 
cache, providing efficient access to the working set. 
Protocol extensions: Our approach requires augment- 
ing the format of READ and WRITE commands with 
session annotations and our prototype implementation 
extends the iSCSI protocol with a new AHS type for this 
purpose. A transport-level modification simplified our 
software implementation, but would be difficult to de- 
ploy in a production environment, since it would require 
modifying the HBAs. For a more easily-deployable so- 
lution, the required set of extensions can be implemented 
in a transport-independent manner at the SCSI command 
level. One option would be to use an extended command 
descriptor block (XCDB), as defined in SPC-4 ( [13], sec- 
tion 4.3.4), and introduce a new descriptor extension type 
for carrying the session annotation. Likewise, command 
rejection can be signaled to the initiator via a CHECK 
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CONDITION status code with a new additional sense 
code and the response annotation can be communicated 
as fixed-format sense data ( [13], section 4.5.3). 
Programming model: Another concern is that Minuet 
imposes a different programming model, exposing appli- 
cation developers to additional exception cases that do 
not naturally arise under strong locking. When a tradi- 
tional DLM service grants a lock to an application pro- 
cess, the lock is assumed to be valid and the client can 
proceed to accessing the disk without worrying about 
conflicting access from other clients. In contrast, Minuet 
gives out locks in a more permissive manner, but pro- 
vides machinery for detecting and resolving conflicting 
access at the storage device. As a result, applications 
that rely on Minuet for concurrency control must be pro- 
grammed with the assumption that any I/O request can 
fail with EBADSESSION due to inconsistent lock state. 

We observe that while I/O rejection does not occur 
under conservative locking, the protocols employed by 
traditional DLMs for ensuring system-wide consistency 
of locking state inevitably expose application developers 
to analogous exception cases. For instance, a network 
connectivity problem causing some application node to 
lose connectivity to a majority of lock managers would 
typically cause that node to observe a DLM-related ex- 
ception event. More concretely, the application process 
would be informed that due to lack of connectivity, some 
of its locks may no longer be valid - these are precisely 
the semantics of Minuet’s ForcedDowngrade notifica- 
tion. Hence, both models demand exception-handling for 
dealing with forced lock revocation. 

With Minuet, a node that finds itself partitioned from 
the rest of the cluster need not immediately give up its 
locks and instead, can perform a more granular recov- 
ery action. For example, it can switch to the optimistic 
method and resume disk access without coordinating 
with other application processes and this would permit 
it to make progress in the absence of conflicting access. 

Our experience with developing and deploying sam- 
ple applications on top of Minuet suggests that the avail- 
ability benefits enabled by the use of such fine-grained 
recovery actions are certainly worth the extra implemen- 
tation effort, which we believe to be relatively small. The 
chunkmap application was initially implemented on top 
of conventional locking using 327 lines of C code and 
extending the implementation to operate on top of Min- 
uet required adding only 15 lines of code to handle the 
EBADSESSION notifications. 


7 Related Work 


Concurrency control has been extensively studied in the 
operating systems, distributed systems, and database 
communities. VMS [40] was among the first widely- 
available platforms to provide application developers 
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with the abstraction of a general-purpose distributed lock 
manager and today, DLMs are generally viewed as a use- 
ful building block for distributed applications. 

Clustered and distributed filesystems [8, 10, 35-37] 
and relational databases [9] rely on locking or lease- 
based mechanisms to coordinate access to shared appli- 
cation state. Both sets of mechanisms make certain as- 
sumptions about timing, such as partially-synchronized 
clocks and bounded communication latency, in order to 
operate safely. These systems can directly leverage Min- 
uet to ensure safe coordination of concurrent access to 
shared data on disk without assuming synchrony. 

In web service data centers, distributed coordination 
services such as Chubby [20] and Zookeeper [15] have 
also become popular. These services are intended pri- 
marily for coarse-grained synchronization - a typical use 
case might be to elect a master among a set of candi- 
dates. Although the intended use of Minuet is to pro- 
vide fine-grained synchronization in a shared-disk clus- 
ter, our system can also support such use cases by tran- 
sitioning to strongly-consistent locking, whereby each 
lock is acquired with a majority voter set. Unlike our 
system, Chubby provides a hierarchical namespace and 
the ability to store small pieces of data, but these fea- 
tures are largely orthogonal to our approach. Chubby’s 
lock sequencer mechanism allows servers to detect out- 
of-order requests submitted under an outdated lock and 
our timestamp-based sessions generalize this idea to sup- 
port shared-exclusive locking. We also develop this no- 
tion further and observe that once we have the ability to 
reject inconsistent requests at the destination, very little 
is gained by enforcing strong consistency on replicated 
locking state and specifically, the use of an agreement 
protocol (e.g., Paxos [31]) may be more than necessary. 

Concurrency control and transaction mechanisms have 
been extensively studied in databases. ARIES [33] is 
a state-of-the-art transaction recovery algorithm for a 
centralized database, supporting fine-granularity lock- 
ing and partial rollbacks of transactions, while D- 
ARIES [38] extends this work to be usable in distributed 
shared-disk databases. Implementing these mechanisms 
on top of Minuet’s locking and I/O facilities would en- 
sure that they retain their safety properties in the face 
of arbitrary asynchrony. Minuet’s basic transaction ser- 
vice presented in Section 3.4 is a variation of timestamp- 
based concurrency control - a standard and well-known 
technique in relational database design. Finally, database 
researchers have explored hybrid approaches to concur- 
rency control [34] that enable tradeoffs between opti- 
mism and strict coordination and our work enables sim- 
ilar tradeoffs for general SAN applications, where the 
data resides on application-agnostic block devices. 

There have been several research projects tackling the 
intelligence/information gap between operating systems 


and storage systems [17, 25,28]. These projects aim to 
achieve more expressive storage interfaces by exposing 
more information or adding more intelligence to storage 
devices. In our work, we identified and tackled safety 
problems in SANs by narrowing the intelligence gap be- 
tween clustered applications and SAN storage devices. 

Several earlier projects have investigated new ap- 
proaches to concurrency control via functional exten- 
sions to storage devices. [29] proposes Dlocks as a new 
primitive for distributed mutual exclusion, whereby the 
lock acquisition state is maintained by the target devices 
themselves and manipulated by the initiators using a new 
SCSI command. Due to the inherent complexity of dis- 
tributed locking, the lock management functionality has 
proven too difficult to implement in a SAN storage array 
and as a result, this mechanism did not attain wide accep- 
tance among the storage device vendors. Follow-on work 
to the initial proposal presented a simplified scheme in 
form of DMEP [12]. In this scheme, storage devices ex- 
pose an array of shared memory buffers holding the lock 
state and clients manipulate this state directly using sim- 
ple atomic commands. The DMEP specification was im- 
plemented by a storage device vendor [2] and used by 
earlier versions of GFS [35]. 

Our work revisits the idea of device-assisted synchro- 
nization and is in line with these earlier efforts, but dif- 
fers in several crucial respects. First, rather than ex- 
tending the storage devices with lock management func- 
tions, we propose a more general synchronization prim- 
itive that supports a wider range of coordination tech- 
niques. In addition to "traditional" conservative locking, 
Minuet enables the use of optimistic concurrency con- 
trol, which has been shown to reduce the synchronization 
overhead and deliver better performance for certain ap- 
plication workloads. As a result, Minuet enables a new 
degree of freedom in the design space of parallel SAN 
applications, enabling the developers to safely exploit the 
tradeoffs between synchronization overhead, access con- 
flict rate, and application availability. Second, acquir- 
ing or releasing a lock in Minuet does not require ex- 
plicit communication with the target storage device and 
instead, clients annotate outgoing I/O requests with the 
relevant synchronization state. This technique addresses 
the problem of delayed requests delivered under the pro- 
tection of an outdated lock and thus enables SAN ap- 
plications to guarantee safety despite arbitrary message 
delays, drifting clocks, and node failures. Finally, unlike 
prior proposals, our design does not require new SCSI 
commands and can be implemented within the confines 
of existing protocol standards. 

Similar in spirit to this work, SCSI-3 Persistent Re- 
serve [11] tries to address the safety problems in shared- 
disk environments by extending the storage protocol and 
target devices. Revoking a suspected node’s reservation 
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typically necessitates a global decision on declaring the 
respective node faulty, which, in turn, requires majority 
agreement. Hence, SCSI-3 PR offers safety but not live- 
ness in the presence of network partitions and massive 
node failures, while Minuet provides both. 


8 Conclusion 


This paper investigates a novel approach to concurrency 
control in SANs. Today, clustered SAN applications co- 
ordinate access to shared state on disks using strongly- 
consistent locking protocols, which are subject to safety 
and liveness issues in the presence of asynchrony and 
failures. To solve these problems, we augment target de- 
vices with a small amount of guard logic, which enables 
us to provide a property called session isolation and a re- 
laxed model of locking which, in turn, provide a building 
block for distributed transactions. They also enable us to 
loosen the consistency requirements on distributed lock- 
ing state, thus providing high availability despite failures 
and network partitions. 

We have designed, implemented, and evaluated Min- 
uet, a DLM-like synchronization and transaction mod- 
ule for SAN applications based on the protocols we pre- 
sented. Our evaluation suggests that distributed appli- 
cations built atop Minuet enjoy good performance and 
availability, while guaranteeing safety. 
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